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PREFACE TO THE 
FOURTH EDITION 



This fourth edition, completed in 2007, contains new examples, 130 new figures, and output from five 
computer software packages representative of the hundreds or perhaps thousands of computer software 
packages that are available for use in statistics. All figures in the third edition are replaced by new and 
sometimes different figures created by the five software packages: EXCEL, MINITAB, SAS, SPSS, and 
STATISTIX. My examples were greatly influenced by USA Today as I put the book together. This 
newspaper is a great source of current uses and examples in statistics. 

Other changes in the book include the following. Chapter 18 on the analyses of time series was 
removed and Chapter 19 on statistical process control and process capability is now Chapter 18. The 
answers to the supplementary problems at the end of each chapter are given in more detail. More 
discussion and use of ^-values is included throughout the book. 



ACKNOWLEDGMENTS 

As the statistical software plays such an important role in the book, I would like to thank the 
following people and companies for the right to use their software. 

MINITAB: Ms. Laura Brown, Coordinator of the Author Assistance Program, Minitab Inc., 1829 Pine 
Hall Road, State College, PA 16801. I am a member of the author assistance program that Minitab 
sponsors. "Portions of the input and output contained in this publication/book are printed with the 
permission of Minitab Inc. All material remains the exclusive property and copyright of Minitab Inc. All 
rights reserved." The web address for Minitab is www.minitab.com. 

SAS: Ms. Sandy Varner, Marketing Operations Manager, SAS Publishing, Cary, NC. "Created with 
SAS software. Copyright 2006. SAS Institute Inc., Cary, NC, USA. All rights reserved. Reproduced with 
permission of SAS Institute Inc., Cary, NC." I quote from the Website: "SAS is the leader in business 
intelligence software and services. Over its 30 years, SAS has grown - from seven employees to nearly 
10,000 worldwide, from a few customer sites to more than 40,000 - and has been profitable every year." 
The web address for SAS is www.sas.com. 

SPSS: Ms. Jill Rietema, Account Manager, Publications, SPSS. I quote from the Website: "SPSS Inc. 
is a leading worldwide provider of predictive analytics software and solutions. Founded in 1968, 
today SPSS has more than 250,000 customers worldwide, served by more than 1200 employees in 
60 countries." The Web address for SPSS is www.spss.com. 

STATISTIX: Dr. Gerard Nimis, President, Analytical Software, P.O. Box 12185, Tallahassee, FL 32317. 
I quote from the Website: "If you have data to analyze, but you're a researcher, not a statistician. 
Statistix is designed for you. You'll be up and running in minutes-without programming or using a 
manual. This easy to learn and simple to use software saves you valuable time and money. Statistix 
combines all the basic and advanced statistics and powerful data manipulation tools you need in a single, 
inexpensive package." The Web address for STATISTIX is www.statistix.com. 

EXCEL: Microsoft Excel has been around since 1985. It is available to almost all college students. It is 
widely used in this book. 
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PREFACE TO THE FOURTH EDITION 



I wish to thank Stanley Wileman for the computer advice that he so unselfishly gave to me during 
the writing of the book. I wish to thank my wife, Lana, for her understanding while I was often day 
dreaming and thinking about the best way to present some concept. Thanks to Chuck Wall, Senior 
Acquisitions Editor, and his staff at McGraw-Hill. Finally, I would like to thank Jeremy Toynbee, 
project manager at Keyword Publishing Services Ltd., London, England, and John Omiston, freelance 
copy editor, for their fine production work. I invite comments and questions about the book at 
Lstephens@mail.unomaha.edu. 



Larry J. Stephens 



PREFACE TO THE 
THIRD EDITION 



In preparing this third edition of Schaum 's Outline of Statistics, I have replaced dated problems with 
problems that reflect the technological and sociological changes that have occurred since the first edition 
was published in 1961. One problem in the second edition dealt with the lifetimes of radio tubes, for 
example. Since most people under 30 probably do not know what radio tubes are, this problem, as well 
as many other problems, have been replaced by problems involving current topics such as health care 
issues, AIDS, the Internet, cellular phones, and so forth. The mathematical and statistical aspects have 
not changed, but the areas of application and the computational aspects of statistics have changed. 

Another important improvement is the introduction of statistical software into the text. The devel- 
opment of statistical software packages such as SAS, SPSS, and MINITAB has dramatically changed the 
application of statistics to real-world problems. One of the most widely used statistical packages in 
academia as well as in industrial settings is the package called MINITAB (Minitab Inc., 1829 Pine 
Hall Road, State College, PA 16801). I would like to thank Minitab Inc. for granting me permission 
to include Minitab output throughout the text. Many modern statistical textbooks include computer 
software output as part of the text. I have chosen to include Minitab because it is widely used and is very 
friendly. Once a student learns the various data file structures needed to use MINITAB, and the 
structure of the commands and subcommands, this knowledge is readily transferable to other statistical 
software. With the introduction of pull-down menus and dialog boxes, the software has been made even 
friendlier. I include both commands and pull-down menus in the Minitab discussions in the text. 

Many of the new problems discuss the important statistical concept of the ^-value for a statistical 
test. When the first edition was introduced in 1961, the /rvalue was not as widely used as it is today, 
because it is often difficult to determine without the aid of computer software. Today ^-values are 
routinely provided by statistical software packages since the computer software computation of ^-values 
is often a trivial matter. 

A new chapter entitled "Statistical Process Control and Process Capability" has replaced Chapter 
19, "Index Numbers." These topics have many industrial applications, and I felt they needed to be 
included in the text. The inclusion of the techniques of statistical process control and process capability 
in modern software packages has facilitated the implementation of these techniques in many industrial 
settings. The software performs all the computations, which are rather burdensome. I chose to use 
Minitab because I feel that it is among the best software for SPC applications. 

I wish to thank my wife Lana for her understanding during the preparation of the book; my friend 
Stanley Wileman for all the computer help he has given me; and Alan Hunt and the staff at Keyword 
Publishing Services Ltd., London, England, for their fine production work. Finally, I wish to thank the 
staff at McGraw-Hill for their cooperation and helpfulness. 

Larry J. Stephens 
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PREFACE TO THE 
SECOND EDITION 



Statistics, or statistical methods as it is sometimes called, is playing an increasingly important role in 
nearly all phases of human endeavor. Formerly dealing only with affairs of the state, thus accounting for 
its name, the influence of statistics has now spread to agriculture, biology, business, chemistry, commu- 
nications, economics, education, electronics, medicine, physics, political science, psychology, sociology 
and numerous other fields of science and engineering. 

The purpose of this book is to present an introduction to the general statistical principles which will 
be found useful to all individuals regardless of their fields of specialization. It has been designed for use 
either as a supplement to all current standard texts or as a textbook for a formal course in statistics. 
It should also be of considerable value as a book of reference for those presently engaged in applications 
of statistics to their own special problems of research. 

Each chapter begins with clear statements of pertinent definitions, theorems and principles together 
with illustrative and other descriptive material. This is followed by graded sets of solved and supple- 
mentary problems which in many instances use data drawn from actual statistical situations. The solved 
problems serve to illustrate and amplify the theory, bring into sharp focus those fine points without 
which the students continually feel themselves to be on unsafe ground, and provide the repetition of 
basic principles so vital to effective teaching. Numerous derivations of formulas are included among the 
solved problems. The large number of supplementary problems with answers serve as a complete review 
of the material of each chapter. 

The only mathematical background needed for an understanding of the entire book is arithmetic and 
the elements of algebra. A review of important mathematical concepts used in the book is presented in 
the first chapter which may either be read at the beginning of the course or referred to later as the need 

The early part of the book deals with the analysis of frequency distributions and associated measures 
of central tendency, dispersion, skewness and kurtosis. This leads quite naturally to a discussion of 
elementary probability theory and applications, which paves the way for a study of sampling theory. 
Techniques of large sampling theory, which involve the normal distribution, and applications to statis- 
tical estimation and tests of hypotheses and significance are treated first. Small sampling theory, invol- 
ving Student's t distribution, the chi-square distribution and the F distribution together with the 
applications appear in a later chapter. Another chapter on curve fitting and the method of least squares 
leads logically to the topics of correlation and regression involving two variables. Multiple and partial 
correlation involving more than two variables are treated in a separate chapter. These are followed by 
chapters on the analysis of variance and nonparametric methods, new in this second edition. Two final 
chapters deal with the analysis of time series and index numbers respectively. 

Considerably more material has been included here than can be covered in most first courses. This 
has been done to make the book more flexible, to provide a more useful book of reference and to 
stimulate further interest in the topics. In using the book it is possible to change the order of many 
later chapters or even to omit certain chapters without difficulty. For example, Chapters 13-15 and 
18-19 can, for the most part, be introduced immediately after Chapter 5, if it is desired to treat correla- 
tion, regression, times series, and index numbers before sampling theory. Similarly, most of Chapter 6 
may be omitted if one does not wish to devote too much time to probability. In a first course all of 
Chapter 15 may be omitted. The present order has been used because there is an increasing tendency in 
modern courses to introduce sampling theory and statistical influence as early as possible. 

I wish to thank the various agencies, both governmental and private, for their cooperation in supply- 
ing data for tables. Appropriate references to such sources are given throughout the book. In particular, 
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I am indebted to Professor Sir Ronald A. Fisher, F.R.S., Cambridge; Dr. Frank Yates, F.R.S., 
Rothamsted; and Messrs. Oliver and Boyd Ltd., Edinburgh, for permission to use data from Table 
III of their book Statistical Tables for Biological, Agricultural, and Medical Research. I also wish to thank 
Esther and Meyer Scher for their encouragement and the staff of McGraw-Hill for their cooperation. 



Murray R. Spiegel 



For more information about this title, click here 



CONTENTS 



CHAPTER 1 Variables and Graphs 1 

Statistics 1 

Population and Sample; Inductive and Descriptive Statistics 1 

Variables: Discrete and Continuous 1 

Rounding of Data 2 

Scientific Notation 2 

Significant Figures 3 

Computations 3 

Functions 4 

Rectangular Coordinates 4 

Graphs 4 

Equations 5 

Inequalities 5 

Logarithms 6 

Properties of Logarithms 7 

Logarithmic Equations 7 

CHAPTER 2 Frequency Distributions 37 

Raw Data 37 

Arrays 37 

Frequency Distributions 37 

Class Intervals and Class Limits 38 

Class Boundaries 38 

The Size, or Width, of a Class Interval 38 

The Class Mark 38 

General Rules for Forming Frequency Distributions 38 

Histograms and Frequency Polygons 39 

Relative-Frequency Distributions 39 

Cumulative-Frequency Distributions and Ogives 40 
Relative Cumulative-Frequency Distributions and 

Percentage Ogives 40 

Frequency Curves and Smoothed Ogives 41 

Types of Frequency Curves 41 

CHAPTER 3 The Mean, Median, Mode, and Other Measures of 

Central Tendency 61 

Index, or Subscript, Notation 61 

Summation Notation 61 

Averages, or Measures of Central Tendency 62 



xiii 



xiv 



CONTENTS 



The Arithmetic Mean 62 

The Weighted Arithmetic Mean 62 

Properties of the Arithmetic Mean 63 

The Arithmetic Mean Computed from Grouped Data 63 

The Median 64 

The Mode 64 
The Empirical Relation Between the Mean, Median, and Mode 64 

The Geometric Mean G 65 

The Harmonic Mean H 65 
The Relation Between the Arithmetic, Geometric, and Harmonic 

Means 66 

The Root Mean Square 66 

Quartiles, Deciles, and Percentiles 66 

Software and Measures of Central Tendency 67 

CHAPTER 4 The Standard Deviation and Other Measures of 

Dispersion 95 

Dispersion, or Variation 95 

The Range 95 

The Mean Deviation 95 

The Semi-Interquartile Range 96 

The 10-90 Percentile Range 96 

The Standard Deviation 96 

The Variance 97 

Short Methods for Computing the Standard Deviation 97 

Properties of the Standard Deviation 98 

Charlier's Check 99 

Sheppard's Correction for Variance 100 

Empirical Relations Between Measures of Dispersion 100 

Absolute and Relative Dispersion; Coefficient of Variation 100 

Standardized Variable; Standard Scores 101 

Software and Measures of Dispersion 101 

CHAPTER 5 Moments, Skewness, and Kurtosis 123 

Moments 123 

Moments for Grouped Data 123 

Relations Between Moments 124 

Computation of Moments for Grouped Data 124 

Charlier's Check and Sheppard's Corrections 124 

Moments in Dimensionless Form 124 

Skewness 125 

Kurtosis 125 

Population Moments, Skewness, and Kurtosis 126 

Software Computation of Skewness and Kurtosis 126 



CONTENTS xv 



CHAPTER 6 Elementary Probability Theory 139 

Definitions of Probability 139 
Conditional Probability; Independent and 

Dependent Events 140 

Mutually Exclusive Events 141 

Probability Distributions 142 

Mathematical Expectation 144 

Relation Between Population, Sample Mean, and Variance 144 

Combinatorial Analysis 145 

Combinations 146 

Stirling's Approximation to n\ 146 

Relation of Probability to Point Set Theory 146 

Euler or Venn Diagrams and Probability 146 

CHAPTER 7 The Binomial, Normal, and Poisson Distributions 172 

The Binomial Distribution 172 

The Normal Distribution 173 

Relation Between the Binomial and Normal Distributions 174 

The Poisson Distribution 175 

Relation Between the Binomial and Poisson Distributions 176 

The Multinomial Distribution 177 
Fitting Theoretical Distributions to Sample Frequency 

Distributions 177 

CHAPTER 8 Elementary Sampling Theory 203 

Sampling Theory 203 

Random Samples and Random Numbers 203 

Sampling With and Without Replacement 204 

Sampling Distributions 204 

Sampling Distribution of Means 204 

Sampling Distribution of Proportions 205 

Sampling Distributions of Differences and Sums 205 

Standard Errors 207 

Software Demonstration of Elementary Sampling Theory 207 

CHAPTER 9 Statistical Estimation Theory 227 

Estimation of Parameters 227 

Unbiased Estimates 227 

Efficient Estimates 228 

Point Estimates and Interval Estimates; Their Reliability 228 

Confidence-Interval Estimates of Population Parameters 228 

Probable Error 230 



xvi 



CONTENTS 



CHAPTER 10 Statistical Decision Theory 245 

Statistical Decisions 245 

Statistical Hypotheses 245 

Tests of Hypotheses and Significance, or Decision Rules 246 

Type I and Type II Errors 246 

Level of Significance 246 

Tests Involving Normal Distributions 246 

Two-Tailed and One-Tailed Tests 247 

Special Tests 248 

Operating-Characteristic Curves; the Power of a Test 248 

/(-Values for Hypotheses Tests 248 

Control Charts 249 

Tests Involving Sample Differences 249 

Tests Involving Binomial Distributions 250 

CHAPTER 11 Small Sampling Theory 275 

Small Samples 275 

Student's t Distribution 275 

Confidence Intervals 276 

Tests of Hypotheses and Significance 277 

The Chi-Square Distribution 277 

Confidence Intervals for a 278 

Degrees of Freedom 279 

The F Distribution 279 

CHAPTER 12 The Chi-Square Test 294 

Observed and Theoretical Frequencies 294 

Definition of x 294 

Significance Tests 295 

The Chi-Square Test for Goodness of Fit 295 

Contingency Tables 296 

Yates' Correction for Continuity 297 

Simple Formulas for Computing x 297 

Coefficient of Contingency 298 

Correlation of Attributes 298 

Additive Property of x 2 299 

CHAPTER 13 Curve Fitting and the Method of Least Squares 316 

Relationship Between Variables 316 

Curve Fitting 316 

Equations of Approximating Curves 317 

Freehand Method of Curve Fitting 318 

The Straight Line 318 

The Method of Least Squares 319 



CONTENTS xvii 

The Least-Squares Line 319 

Nonlinear Relationships 320 

The Least-Squares Parabola 320 

Regression 321 

Applications to Time Series 321 

Problems Involving More Than Two Variables 321 

CHAPTER 14 Correlation Theory 345 

Correlation and Regression 345 

Linear Correlation 345 

Measures of Correlation 346 

The Least-Squares Regression Lines 346 

Standard Error of Estimate 347 

Explained and Unexplained Variation 348 

Coefficient of Correlation 348 

Remarks Concerning the Correlation Coefficient 349 
Product-Moment Formula for the Linear 

Correlation Coefficient 350 

Short Computational Formulas 350 

Regression Lines and the Linear Correlation Coefficient 351 

Correlation of Time Series 351 

Correlation of Attributes 351 

Sampling Theory of Correlation 351 

Sampling Theory of Regression 352 

CHAPTER 15 Multiple and Partial Correlation 382 

Multiple Correlation 382 

Subscript Notation 382 

Regression Equations and Regression Planes 382 

Normal Equations for the Least-Squares Regression Plane 383 

Regression Planes and Correlation Coefficients 383 

Standard Error of Estimate 384 

Coefficient of Multiple Correlation 384 

Change of Dependent Variable 384 

Generalizations to More Than Three Variables 385 

Partial Correlation 385 
Relationships Between Multiple and Partial Correlation 

Coefficients 386 

Nonlinear Multiple Regression 386 

CHAPTER 16 Analysis of Variance 403 

The Purpose of Analysis of Variance 403 

One- Way Classification, or One-Factor Experiments 403 



xviii 



CONTENTS 



Total Variation, Variation Within Treatments, and Variation 

Between Treatments 404 

Shortcut Methods for Obtaining Variations 404 

Mathematical Model for Analysis of Variance 405 

Expected Values of the Variations 405 

Distributions of the Variations 406 

The F Test for the Null Hypothesis of Equal Means 406 

Analysis-of- Variance Tables 406 

Modifications for Unequal Numbers of Observations 407 

Two-Way Classification, or Two-Factor Experiments 407 

Notation for Two-Factor Experiments 408 

Variations for Two-Factor Experiments 408 

Analysis of Variance for Two-Factor Experiments 409 

Two-Factor Experiments with Replication 410 

Experimental Design 412 

CHAPTER 17 Nonparametric tests 446 

Introduction 446 

The Sign Test 446 

The Mann-Whitney U Test 447 

The Kruskal-Wallis H Test 448 

The H Test Corrected for Ties 448 

The Runs Test for Randomness 449 

Further Applications of the Runs Test 450 

Spearman's Rank Correlation 450 

CHAPTER 18 Statistical Process Control and Process Capability 480 

General Discussion of Control Charts 480 

Variables and Attributes Control Charts 48 1 

Z-bar and R Charts 481 

Tests for Special Causes 484 

Process Capability 484 

P- and VP-Charts 487 

Other Control Charts 489 

Answers to Supplementary Problems 505 

Appendixes 559 

I Ordinates (Y) of the Standard Normal Curve at z 561 

II Areas Under the Standard Normal Curve from to z 562 

III Percentile Values (t p ) for Student's t Distribution with v Degrees of Freedom 563 

IV Percentile Values (Xp) f° r tne Chi-Square Distribution with v Degrees of 

Freedom 564 



CONTENTS xix 

V 95th Percentile Values for the F Distribution 565 

VI 99th Percentile Values for the F Distribution 566 

VII Four-Place Common Logarithms 567 

VIII Values of e~ x ' 569 

IX Random Numbers 570 

Index 571 



This page intentionally left blank 




Theory and Problems of 

STATISTICS 



This page intentionally left blank 



Variables and Graphs 



STATISTICS 

Statistics is concerned with scientific methods for collecting, organizing, summarizing, presenting, 
and analyzing data as well as with drawing valid conclusions and making reasonable decisions on the 
basis of such analysis. 

In a narrower sense, the term statistics is used to denote the data themselves or numbers derived 
from the data, such as averages. Thus we speak of employment statistics, accident statistics, etc. 

POPULATION AND SAMPLE; INDUCTIVE AND DESCRIPTIVE STATISTICS 

In collecting data concerning the characteristics of a group of individuals or objects, such as the 
heights and weights of students in a university or the numbers of defective and nondefective bolts 
produced in a factory on a given day, it is often impossible or impractical to observe the entire 
group, especially if it is large. Instead of examining the entire group, called the population, or universe, 
one examines a small part of the group, called a sample. 

A population can be finite or infinite. For example, the population consisting of all bolts produced 
in a factory on a given day is finite, whereas the population consisting of all possible outcomes (heads, 
tails) in successive tosses of a coin is infinite. 

If a sample is representative of a population, important conclusions about the population can often 
be inferred from analysis of the sample. The phase of statistics dealing with conditions under which such 
inference is valid is called inductive statistics, or statistical inference. Because such inference cannot be 
absolutely certain, the language of probability is often used in stating conclusions. 

The phase of statistics that seeks only to describe and analyze a given group without drawing any 
conclusions or inferences about a larger group is called descriptive, or deductive, statistics. 

Before proceeding with the study of statistics, let us review some important mathematical concepts. 



VARIABLES: DISCRETE AND CONTINUOUS 

A variable is a symbol, such as X, Y, H, x, or B, that can assume any of a prescribed set of values, 
called the domain of the variable. If the variable can assume only one value, it is called a constant. 

A variable that can theoretically assume any value between two given values is called a continuous 
variable; otherwise, it is called a discrete variable. 

EXAMPLE 1. The number N of children in a family, which can assume any of the values 0, 1, 2, 3, . .. but cannot 
be 2.5 or 3.842, is a discrete variable. 
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VARIABLES AND GRAPHS 



[CHAP. 1 



EXAMPLE 2. The height H of an individual, which can be 62 inches (in), 63.8 in, or 65.8341 in, depending on the 
accuracy of measurement, is a continuous variable. 

Data that can be described by a discrete or continuous variable are called discrete data or continuous 
data, respectively. The number of children in each of 1000 families is an example of discrete data, while 
the heights of 100 university students is an example of continuous data. In general, measurements give 
rise to continuous data, while enumerations, or countings, give rise to discrete data. 

It is sometimes convenient to extend the concept of variable to nonnumerical entities; for example, 
color C in a rainbow is a variable that can take on the "values" red, orange, yellow, green, blue, indigo, 
and violet. It is generally possible to replace such variables by numerical quantities; for example, denote 
red by 1, orange by 2, etc. 



ROUNDING OF DATA 

The result of rounding a number such as 72.8 to the nearest unit is 73, since 72.8 is closer to 73 than 
to 72. Similarly, 72.8146 rounded to the nearest hundredth (or to two decimal places) is 72.81, since 
72.8146 is closer to 72.81 than to 72.82. 

In rounding 72.465 to the nearest hundredth, however, we are faced with a dilemma since 72.465 is 
just as far from 72.46 as from 72.47. It has become the practice in such cases to round to the even integer 
preceding the 5. Thus 72.465 is rounded to 72.46, 183.575 is rounded to 183.58, and 1 16,500,000 rounded 
to the nearest million is 116,000,000. This practice is especially useful in minimizing cumulative rounding 
errors when a large number of operations is involved (see Problem 1.4). 



SCIENTIFIC NOTATION 

When writing numbers, especially those involving many zeros before or after the decimal point, it is 
convenient to employ the scientific notation using powers of 10. 

EXAMPLE 3. 10 1 = 10, 10 2 = 10 x 10 = 100, 10 s = 10 x 10 x 10 x 10 x 10 = 100,000, and 10 8 = 100,000,000. 
EXAMPLE 4. 10° = 1; 10" 1 = .1, or 0.1; 10~ 2 = .01, or 0.01; and 10~ 5 = .00001, or 0.00001. 
EXAMPLE 5. 864,000,000 = 8.64 x 10 8 , and 0.00003416 = 3.416 x 10~ 5 . 

Note that multiplying a number by 10 8 , for example, has the effect of moving the decimal point of 
the number eight places to the right. Multiplying a number by 10~ 6 has the effect of moving the decimal 
point of the number six places to the left. 

We often write 0.1253 rather than .1253 to emphasize the fact that a number other than zero before 
the decimal point has not accidentally been omitted. However, the zero before the decimal point can be 
omitted in cases where no confusion can result, such as in tables. 

Often we use parentheses or dots to show the multiplication of two or more numbers. Thus 
(5)(3) = 5-3 = 5x3=15, and (10)(10)(10) = 10 • 10 • 10 = 10 x 10 x 10 = 1000. When letters are 
used to represent numbers, the parentheses or dots are often omitted; for example, 
ab={a){b) = a-b = axb. 

The scientific notation is often useful in computation, especially in locating decimal points. Use is 
then made of the rules 

^Cl p 

(10'XIO') = \o p+q -^-=io p -<i 

where p and q are any numbers. 

In \0 P , p is called the exponent and 10 is called the base. 
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EXAMPLE 6. (10 3 )(10 2 ) = 1000 x 100 = 100,000 = 10 5 i.e., 10 3+2 

10 6 1,000,000 , . , , 4 

T^ = ^ooo-= 100=10 1 - e -' 10 

EXAMPLE 7. (4,000,000) (0.0000000002) = (4 x 10 6 )(2 x lO" 10 ) = (4)(2) (10 6 ) ( lO" 10 ) = 8 x lO 6 " 10 
= 8 x 10~ 4 = 0.0008 

EXAMPLE 8. (0.006)(80,000) _ (6 x 10~ 3 )(8 x 10 4 ) _ 48 x 10 1 _ /48\ lQ i-(-2) 

0.04 ~ 4 x 10- 2 ~ 4 x 10- 2 ~ V 4 y X 
= 12 x 10 3 = 12,000 



SIGNIFICANT FIGURES 

If a height is accurately recorded as 65.4 in, it means that the true height lies between 65.35 and 
65.45 in. The accurate digits, apart from zeros needed to locate the decimal point, are called the 
significant digits, or significant figures, of the number. 



EXAMPLE 9. 65.4 has three significant figures. 

EXAMPLE 10. 4.5300 has five significant figures. 

EXAMPLE 11. .0018 = 0.0018 = 1.8 x 10~ 3 has two significant figures. 

EXAMPLE 12. .001800 = 0.001800 = 1.800 x 10~ 3 has four significant figures. 



Numbers associated with enumerations (or countings), as opposed to measurements, are of course 
exact and so have an unlimited number of significant figures. In some of these cases, however, it may be 
difficult to decide which figures are significant without further information. For example, the number 
186,000,000 may have 3, 4, . . . , 9 significant figures. If it is known to have five significant figures, it would 
be better to record the number as either 186.00 million or 1.8600 x 10 8 . 



COMPUTATIONS 

In performing calculations involving multiplication, division, and the extraction of roots of num- 
bers, the final result can have no more significant figures than the numbers with the fewest significant 
figures (see Problem 1.9). 

EXAMPLE 13. 73.24 x 4.52 = (73.24)(4.52) = 331 

EXAMPLE 14. 1.648/0.023 = 72 

EXAMPLE 15. v / 38?7 = 6.22 

EXAMPLE 16. (8.416)(50) = 420.8 (if 50 is exact) 

In performing additions and subtractions of numbers, the final result can have no more significant 
figures after the decimal point than the numbers with the fewest significant figures after the decimal point 
(see Problem 1.10). 



EXAMPLE 17. 3.16 + 2.7 = 5.9 
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EXAMPLE 18. 83.42 - 72 = 11 

EXAMPLE 19. 47.816 - 25 = 22.816 (if 25 is exact) 

The above rule for addition and subtraction can be extended (see Problem 1.11). 



FUNCTIONS 

If to each value that a variable X can assume there corresponds one or more values of a variable Y, 
we say that Y is a. function of X and write Y = F(X) (read " Y equals F of X") to indicate this functional 
dependence. Other letters (G, <j), etc.) can be used instead of F. 

The variable X is called the independent variable and Y is called the dependent variable. 

If only one value of Y corresponds to each value of X, we call Y a single-valued function of X; 
otherwise, it is called a multiple-valued function of X. 

EXAMPLE 20. The total population P of the United States is a function of the time t, and we write P = F(t). 

EXAMPLE 21. The stretch S of a vertical spring is a function of the weight W placed on the end of the spring. 
In symbols, S=G(W). 

The functional dependence (or correspondence) between variables is often depicted in a table. 
However, it can also be indicated by an equation connecting the variables, such as Y = 2X - 3, from 
which Y can be determined corresponding to various values of X. 

If Y = F(X), it is customary to let F(3) denote "the value of Y when X = 3," to let F(10) denote 
"the value of Y when X = 10," etc. Thus if Y = F(X) = X 2 , then F(3) = 3 2 = 9 is the value of Y when 
X=3. 

The concept of function can be extended to two or more variables (see Problem 1.17). 



RECTANGULAR COORDINATES 

Fig. 1-1 shows an EXCEL scatter plot for four points. The scatter plot is made up of two mutually 
perpendicular lines called the X and Y axes. The X axis is horizontal and the Y axis is vertical. The two 
axes meet at a point called the origin. The two lines divide the XY plane into four regions denoted by I, II, 
III, and IV and called the first, second, third, and fourth quadrants. Four points are shown plotted in 
Fig. 1-1. The point (2, 3) is in the first quadrant and is plotted by going 2 units to the right along the 
Zaxis from the origin and 3 units up from there. The point (-2.3, 4.5) is in the second quadrant and is 
plotted by going 2.3 units to the left along the X axis from the origin and then 4.5 units up from there. 
The point (-4, - 3) is in the third quadrant and is plotted by going 4 units to the left of the origin along 
the X axis and then 3 units down from there. The point (3.5, - 4) is in the fourth quadrant and is plotted 
by going 3.5 units to the right along the X axis and then 4 units down from there. The first number 
in a pair is called the abscissa of the point and the second number is called the ordinate of the point. 
The abscissa and the ordinate taken together are called the coordinates of the point. 

By constructing a Z axis through the origin and perpendicular to the XY plane, we can easily extend 
the above ideas. In such cases the coordinates of a point would be denoted by (X, Y, Z). 



GRAPHS 

A graph is a pictorial presentation of the relationship between variables. Many types of graphs are 
employed in statistics, depending on the nature of the data involved and the purpose for which the graph 
is intended. Among these are bar graphs, pie graphs, pictographs, etc. These graphs are sometimes 
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Fig. 1-1 An EXCEL plot of points in the four quadrants. 

referred to as charts or diagrams. Thus we speak of bar charts, pie diagrams, etc. (see Problems 1.23, 
1.24, 1.25, 1.26, and 1.27). 



EQUATIONS 

Equations are statements of the form A = B, where A is called the left-hand member (or side) of the 
equation, and B the right-hand member (or side). So long as we apply the same operations to both 
members of an equation, we obtain equivalent equations. Thus we can add, subtract, multiply, or divide 
both members of an equation by the same value and obtain an equivalent equation, the only exception 
being that division by zero is not allowed. 

EXAMPLE 22. Given the equation 2X + 3 = 9, subtract 3 from both members: 2X + 3-3 = 9-3, or 2X = 6. 
Divide both members by 2: 2X/2 = 6/2, or X= 3. This value of X is a solution of the given equation, as seen by 
replacing X by 3, obtaining 2(3) + 3 = 9, or 9 = 9, which is an identity. The process of obtaining solutions of 
an equation is called solving the equation. 

The above ideas can be extended to finding solutions of two equations in two unknowns, three 
equations in three unknowns, etc. Such equations are called simultaneous equations (see Problem 1.30). 



INEQUALITIES 

The symbols < and > mean "less than" and "greater than," respectively. The symbols < and > 
mean "less than or equal to" and "greater than or equal to," respectively. They are known as inequality 
symbols. 

EXAMPLE 23. 3 < 5 is read "3 is less than 5." 
EXAMPLE 24. 5 > 3 is read "5 is greater than 3." 
EXAMPLE 25. X < 8 is read "X is less than 8." 
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EXAMPLE 26. X > 10 is read "X is greater than or equal to 10." 

EXAMPLE 27. 4 < Y < 6 is read "4 is less than F, which is less than or equal to 6," or Y is between 4 and 6, 
excluding 4 but including 6," or "Y is greater than 4 and less than or equal to 6." 

Relations involving inequality symbols are called incc/ualiiics. Just as we speak of members of an 
equation, so we can speak of members of an inequality. Thus in the inequality 4 < Y < 6, the members 
are 4, 7, and 6. 

A valid inequality remains valid when: 

1 . The same number is added to or subtracted from each member. 

EXAMPLE 28. Since 15 > 12, 15 + 3 > 12 + 3 (i.e., 18 > 15) and 15 - 3 > 12 - 3 (i.e., 12 > 9). 

2. Each member is multiplied or divided by the same positive number. 
EXAMPLE 29. Since 15 > 12, (15)(3) > (12)(3) (i.e., 45 > 36) and 15/3 > 12/3 (i.e., 5 > 4). 

3. Each member is multiplied or divided by the same negative number, provided that the inequality 
symbols are reversed. 

EXAMPLE 30. Since 15 > 12, (15)(-3) < (12)(-3) (i.e., -45 < -36) and 15/(— 3) < 12/(-3) (i.e., -5 < -4). 



LOGARITHMS 

For x > 0, b > 0, and b ^ 1 , y = log A x if and only if log V = x. A logarithm is an exponent. It is the 
power that the base b must be raised to in order to get the number for which you are taking the 

logarithm. The two bases that have traditionally been used are 10 and e, which equals 2.71828182 

Logarithms with base 10 are called common logarithms and are written as log 10 x or simply as log(x). 
Logarithms with base e are called natural logarithms and are written as ln(x). 

EXAMPLE 31. Find the following logarithms and then use EXCEL to find the same logarithms: log 2 8, log 5 25, 
and log 10 1000. Three is the power of 2 that gives 8, and so log 2 8 = 3. Two is the power of 5 that gives 25, and so 
log 5 25 = 2. Three is the power of 10 that gives 1000, and so log 10 1000 = 3. EXCEL contains three functions that 
compute logarithms. The function LN computes natural logarithms, the function LOG10 computes common 
logarithms, and the function LOG(x,b) computes the logarithm of x to base h. =LOG(8,2) gives 3, =LOG(25,5) 
gives 2, and =LOG10(1000) gives 3. 

EXAMPLE 32. Use EXCEL to compute the natural logarithm of the integers 1 through 5. The numbers 1 through 
5 are entered into B1:F1 and the expression =LN(B1) is entered into B2 and a click-and-drag is performed from B2 
to F2. The following EXCEL output was obtained. 

X 1 2 3 4 5 

LN(x) 0.693147 1.098612 1.386294 1.609438 

EXAMPLE 33. Show that the answers in Example 32 are correct by showing that e ln(x) gives the value x. The 
logarithms are entered into B1:F1 and the expression e ln(x \ which is represented by =EXP(B1), is entered into B2 
and a click-and-drag is performed from B2 to F2. The following EXCEL output was obtained. The numbers in D2 
and E2 differ from 3 and 4 because of round off error. 

LN(x) 0.693147 1.098612 1.386294 1.609438 

x = EXP(LN(x)) 1 2 2.999999 3.999999 5 

Example 33 illustrates that if you have the logarithms of numbers (\og h (x)) the numbers (x) may be recovered by 
using the relation /> log ' (A '> = x. 
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EXAMPLE 34. The number e may be defined as a limiting entity. The quantity (1 + (l/x)) x gets closer and closer to 
e when x grows larger. Consider the EXCEL evaluation of (1 + (l/x))' Y for x = 1, 10, 100, 1000, 10000, 100000, and 
1000000. 

x 1 10 100 1000 10000 100000 1000000 

(1+1/x) A x 2 2.593742 2.704814 2.716924 2.718146 2.718268 2.71828 

The numbers 1, 10, 100, 1000, 10000, 100000, and 1000000 are entered into B1:H1 and the expression 
=(1+1/B1)*B1 is entered into B2 and a click-and-drag is performed from B2 to H2. This is expressed 
mathematically by the expression lim Y ^ 0C (l + (l/x)) x = e. 



EXAMPLE 35. The balance of an account earning compound interest, n times per year is given by 
A(t) = P(l + {r/n)) m where P is the principal, r is the interest rate, t is the time in years, and n is the number of 
compound periods per year. The balance of an account earning interest continuously is given by A(t) = Pe" . 
EXCEL is used to compare the continuous growth of $1000 and $1000 that is compounded quarterly after 1, 2, 
3, 4, and 5 years at interest rate 5%. The results are: 

Years 1 2 3 4 5 

Quarterly 1050.95 1104.49 1160.75 1219.89 1282.04 

Continuously 1051.27 1105.17 1161.83 1221.4 1284.03 

The times 1, 2, 3, 4, and 5 are entered into B1:F1. The EXCEL expression =1000*(1.0125) A (4*B1) is 
entered into B2 and a click-and-drag is performed from B2 to F2. The expression =1000*EXP(0.05*B1) 
is entered into B3 and a click-and-drag is performed from B3 to F3. The continuous compounding 
produces slightly better results. 



PROPERTIES OF LOGARITHMS 

The following are the more important properties of logarithms: 

1 . log,, MN = \og h M + \og h N 

2. \og h M/N = \og h M - \og h N 

3. \og h M p = p\og b M 

EXAMPLE 36. Write log fc (x>' 4 /z 3 ) as the sum or difference of logarithms of x, y, and z. 

log A ^y- = log ft xy 4 - log h z 3 property 2 

log* ^- = log,, x + log h y 4 - \og h z 3 property 1 

log A ^y- = log,, x + 4 log A y - 3 log A z property 3 



LOGARITHMIC EQUATIONS 

To solve logarithmic equations: 

1. Isolate the logarithms on one side of the equation. 

2. Express a sum or difference of logarithms as a single logarithm. 

3. Re-express the equation in step 2 in exponential form. 

4. Solve the equation in step 3. 

5. Check all solutions. 
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EXAMPLE 37. Solve the following logarithmic equation: log 4 (x+ 5) = 3. First, re-express in exponential form as 
x + 5 = 4 3 = 64. Then solve for x as follows, x = 64 - 5 = 59. Then check your solution. log 4 (59 + 5) = log 4 (64) = 3 
since 4 3 = 64. 

EXAMPLE 38. Solve the following logarithmic equation. log(6^ - 7) + logy = log(5). Replace the sum of 
logs by the log of the products. log(6>> - l)y = log(5). Now equate (6y—7)y and 5. The result is 
6y 2 - 1y = 5 or 6y 2 - ly - 5 = 0. This quadratic equation factors as (3y - 5)(2y + 1) = 0. The solutions are 
j> = 5/3 and y= -1/2. The -1/2 is rejected since it gives logs of negative numbers which are not defined. 
The y = 5/3 checks out as a solution when tried in the original equation. Therefore our only solution is y = 5/3. 

EXAMPLE 39. Solve the following logarithmic equation: 

ln(5x) - ln(4x + 2) = 4. 

Replace the difference of logs by the log of the quotient, ln(5x/(4x + 2)) = 4. Apply the definition of 
alogarithm: 5x/(4x + 2) = e 4 = 54.59815. Solving the equation 5x = 218. 39260x+ 109.19630 for x gives 
x = -0.5117. However this answer does not check in the equation ln(5x) - ln(4x + 2) = 4, since the log 
function is not denned for negative numbers. The equation ln(5x) - ln(4x + 2) = 4 has no solutions. 
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Solved Problems 



VARIABLES 

1.1 State which of the following represent discrete data and which represent continuous data: 

(a) Numbers of shares sold each day in the stock market 

(b) Temperatures recorded every half hour at a weather bureau 

(c) Lifetimes of television tubes produced by a company 

(d) Yearly incomes of college professors 

(e) Lengths of 1000 bolts produced in a factory 

SOLUTION 

(a) Discrete; (b) continuous; (c) continuous; (d) discrete; (e) continuous. 

1.2 Give the domain of each of the following variables, and state whether the variables are contin- 
uous or discrete: 

(a) Number G of gallons (gal) of water in a washing machine 

(b) Number B of books on a library shelf 

(c) Sum 5 of points obtained in tossing a pair of dice 

(d) Diameter D of a sphere 

(e) Country C in Europe 

SOLUTION 

(a) Domain: Any value from gal to the capacity of the machine. Variable: Continuous. 

(b) Domain: 0, 1, 2, 3, ... up to the largest number of books that can fit on a shelf. Variable: Discrete. 

(c) Domain: Points obtained on a single die can be 1, 2, 3, 4, 5, or 6. Hence the sum of points on a pair of 
dice can be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12, which is the domain of S. Variable: Discrete. 

(d) Domain: If we consider a point as a sphere of zero diameter, the domain of D is all values from zero 
upward. Variable: Continuous. 

(e) Domain: England, France, Germany, etc., which can be represented numerically by 1, 2, 3, etc. 
Variable: Discrete. 



ROUNDING OF DATA 



1.3 Round each of the following numbers to the indicated accuracy: 



(a) 


48.6 


nearest unit 


(./■) 


143.95 


nearest tenth 


(b) 


136.5 


nearest unit 


(g) 


368 


nearest hundred 


(c) 


2.484 


nearest hundredth 


(h) 


24,448 


nearest thousand 


{d) 


0.0435 


nearest thousandth 


(0 


5.56500 


nearest hundredth 


(e) 


4.50001 


nearest unit 


U) 


5.56501 


nearest hundredth 



SOLUTION 

(a) 49; (b) 136; (c) 2.48; (d) 0.044; (e) 5; (/) 144.0; (g) 400; (h) 24,000; (/) 5.56; (J) 5.57. 
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1.4 Add the numbers 4.35, 8.65, 2.95, 12.45, 6.65, 7.55, and 9.75 (a) directly, (b) by rounding to the 
nearest tenth according to the "even integer" convention, and (c) by rounding so as to increase 
the digit before the 5. 

SOLUTION 



(a) 4.35 (b) 4.4 (c) 4.4 

8.65 8.6 8.7 

2.95 3.0 3.0 

12.45 12.4 12.5 

6.65 6.6 6.7 

7.55 7.6 7.6 

9.75 9.8 9.8 
Total 52.35 Total 52.4 Total 52.7 



Note that procedure (b) is superior to procedure (c) because cumulative rounding errors are minimized in 
procedure (b). 



SCIENTIFIC NOTATION AND SIGNIFICANT FIGURES 

1.5 Express each of the following numbers without using powers of 10: 

(a) 4.823 x 10 7 (c) 3.80 x 10~ 4 (e) 300 x 10 8 

(b) 8.4 x 10~ 6 (d) 1.86 x 10 5 (/) 70,000 x 10~ 10 

SOLUTION 

(a) Move the decimal point seven places to the right and obtain 48,230,000; (b) move the decimal 
point six places to the left and obtain 0.0000084; (c) 0.000380; (d) 186,000; (e) 30,000,000,000; 
if) 0.0000070000. 

1.6 How many significant figures are in each of the following, assuming that the numbers are 
recorded accurately? 

(a) 149.8 in (d) 0.00280 m (g) 9 houses 

(b) 149.80 in (e) 1.00280 m (h) 4.0 x 10 3 pounds (lb) 

(c) 0.0028 meter (m) (/) 9 grams (g) (/) 7.58400 x 10~ 5 dyne 

SOLUTION 

(a) Four; (b) five; (c) two; (d) three; (e) six; (/) one; (g) unlimited; (h) two; (;) six. 

1.7 What is the maximum error in each of the following measurements, assuming that they are 
recorded accurately? 

(a) 73.854 in (b) 0.09800 cubic feet (ft 3 ) (c) 3.867 x 10 8 kilometers (km) 

SOLUTION 

(a) The measurement can range anywhere from 73.8535 to 73.8545 in; hence the maximum error is 
0.0005 in. Five significant figures are present. 

(b) The number of cubic feet can range anywhere from 0.097995 to 0.098005 ft 3 ; hence the maximum error 
is 0.000005 ft 3 . Four significant figures are present. 
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(c) The actual number of kilometers is greater than 3.8665 x 10 s but less than 3.8675 x 10 8 ; hence the 
maximum error is 0.0005 x 10 8 , or 50,000 km. Four significant figures are present. 

1.8 Write each number using the scientific notation. Unless otherwise indicated, assume that all 
figures are significant. 

(a) 24,380,000 (four significant figures) (c) 7,300,000,000 (five significant figures) 

(b) 0.000009851 (d) 0.00018400 

SOLUTION 

(a) 2.438 x 10 7 ; (b) 9.851 x 10~ 6 ; (c) 7.3000 x 10 9 ; (d) 1.8400 x 10~ 4 . 



COMPUTATIONS 

1.9 Show that the product of the numbers 5.74 and 3.8, assumed to have three and two significant 
figures, respectively, cannot be accurate to more than two significant figures. 

SOLUTION 

First method 

5.74 x 3.8 = 21.812, but not all figures in this product are significant. To determine how many figures 
are significant, observe that 5.74 stands for any number between 5.735 and 5.745, while 3.8 stands for any 
number between 3.75 and 3.85. Thus the smallest possible value of the product is 5.735 x 3.75 = 21.50625, 
and the largest possible value is 5.745 x 3.85 = 21.11825. 

Since the possible range of values is 21.50625 to 22.11825, it is clear that no more than the first 
two figures in the product can be significant, the result being written as 22. Note that the number 22 stands 
for any number between 21.5 and 22.5. 

Second method 

With doubtful figures in italic, the product can be computed as shown here: 
5.7 4 
3<5 
4592 
1122 
21.812 

We should keep no more than one doubtful figure in the answer, which is therefore 22 to two significant 
figures. Note that it is unnecessary to carry more significant figures than are present in the least accurate 
factor; thus if 5. 74 is rounded to 5.7, the product is 5.7 x 3.8 — 21.66 = 22 to two significant figures, agreeing 
with the above results. 

In calculating without the use of computers, labor can be saved by not keeping more than one or two 
figures beyond that of the least accurate factor and rounding to the proper number of significant figures in 
the final answer. With computers, which can supply many digits, we must be careful not to believe that all 
the digits are significant. 

1.10 Add the numbers 4.19355, 15.28, 5.9561, 12.3, and 8.472, assuming all figures to be significant. 
SOLUTION 

In calculation (a) below, the doubtful figures in the addition are in italic type. The final answer with no 
more than one doubtful figure is recorded as 46.2. 



12 



VARIABLES AND GRAPHS 



[CHAP. 1 



(a) 4.19355 (b) 4.19 

15.25 15.28 

5.956i 5.96 
12.3 12.3 

8.472 8.47 
46.20165 46.20 

Some labor can be saved by proceeding as in calculation (b), where we have kept one more significant 
decimal place than that in the least accurate number. The final answer, rounded to 46.2, agrees with 
calculation (a). 



Calculate 475,000,000+ 12,684,000 - 1,372,410 if these numbers have three, five, and seven 
significant figures, respectively. 



In calculation (a) below, all figures are kept and the final answer is rounded. In calculation (b), a method 
similar to that of Problem 1.10(6) is used. In both cases, doubtful figures are in italic type. 

{a) 475,000,000 487,684,000 {b) 415,000,000 487,700,000 

+ 12,684,000 - 1,372,410 + 12,700,000 - 1,400,000 

487,684,000 486,311,590 487,700,000 486,300,000 

The final result is rounded to 486,000,000; or better yet, to show that there are three significant figures, 
it is written as 486 million or 4.86 x 10 8 . 



1.12 Perform each of the indicated operations. 

,^ ,oa , x (1-47562 - 1.47322) (4895.36) 

(a) 48 - x 943 (e) ~ omnmk " 

(b) 8.35/98 

(c) (28)(4193)(182) (g) 3.1416V/7L35 



(526.7)(0.001280) — 
— (h) V128.5- 



SOLUTION 

(a) 48.0 x 943 = (48.0)(943) = 45,300 

(b) 8.35/98 = 0.085 

(c) (28)(4193)(182) = (2.8 x 10')(4.193 x 10 3 )(1.82 x 10 2 ) 

= (2.8)(4.193)(1.82) x 10 1+3+2 = 21 x 10 6 = 2.1 x 10 7 
This can also be written as 21 million to show the two significant figures. 
(526.7)(0.001280) _ (5.267 x 10 2 )(1.280 x IP' 3 ) _ (5.267)(1.280) (10 2 )(1Q- 3 ) 



id) 



0.000034921 3.4921 x 10" 5 3.4921 10" 5 

10 2-3 10~' 
= 1.931 x^^ 1.931 x^ 

= 1.931 x 10" 1+5 = 1.931 x 10 4 
his can also be written as 19.31 thousand to show the four significant figures. 
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_ (2.40)(4.89536) (irr 3 )(10 J ) 



= 7.38 x — -= 7.38 x 10 4 



1.59180 10- 4 ' 10- 4 

This can also be written as 73.8 thousand to show the three significant figures. Note that although six 
significant figures were originally present in all numbers, some of these were lost in subtracting 1.47322 
from 1.47562. 

(/) If denominators 5 and 6 are exact, + ( - 5A ^ = 3 . 84 + 5 . 09 = 8.85 

(g) 3.1416V71.35 = (3.1416)(8.447) = 26.54 



1.13 Evaluate each of the following, given that X = 3, Y — -5, A — 4, and B = -7, where all numbers 
are assumed to be exact: 

Y 2 - V 2 

(a) 2X-3Y <J) - 



[6A 2 2B 2 



+ 1 

(b) 47-8Z + 28 (g) V2X 2 - Y 2 - 3 A 2 + 4B 2 + 3 

AX + BY 

(C) BX^AY (h) 

(d) X 2 -3XY-2Y 2 

(e) 2(X + 3Y) -4(3X-2Y) 

SOLUTION 

(a) 2X-3F = 2(3)-3(-5) = 6+15 = 21 

(b) 4Y - 8X + 28 = 4(-5) - 8(3) + 28 = -20 - 24 + 28 = -16 
AX + BY = (4)(3) + (-7)(-5) = 12 + 35 = 47 = 

(C> BX-AY (-7)(3) - (4)(-5) -21+20 -1 

(d) X 2 - 3XY - 2Y 2 = (3) 2 - 3(3)(-5) - 2(-5) 2 = 9 + 45 - 50 = 4 

(e) 2(X + 37) - 4(3X - 2Y) = 2[(3) + 3(-5)] - 4[3(3) - 2(-5)] 

= 2(3 - 15) - 4(9 + 10) = 2(-12) - 4(19) = 

Another method 

2(X + 3Y) -4(3X-2F) = 2X + 67 - UX + 8 7 = -10X + 14F = 
= -30 - 70 = -100 



(4) 2 - (-7) 2 + 1 



16-49 + 



(g) V2X 2 - Y 2 - 3A 2 + 4B 2 + 3 = ^2(3) 2 - (-5) 2 - 3(4) 2 + 4(-7) 2 + 3 
= V18 - 25 -48 + 196 + 3 = Vl44 = 12 



/6^ 2 2B 2 _ k^f 2(~lf = /96~J 
VX 7 V 3 -5 V 3 - 



V 12.4 = 3.52 approximately 



FUNCTIONS AND GRAPHS 

1.14 Table 1 . 1 shows the number of bushels (bu) of wheat and corn produced on the Tyson farm during 
the years 2002, 2003, 2004, 2005, and 2006. With reference to this table, determine the year or years 
during which (a) the least number of bushels of wheat was produced, (b) the greatest number of 
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2002 
2003 
2004 
2005 
2006 



bushels of corn was produced, (c) the greatest decline in wheat production occurred, (d) equal 
amounts of wheat were produced, and (e) the combined production of wheat and corn was a 
maximum. 

SOLUTION 

(a) 2004; (b) 2006; (c) 2004; (d) 2002 and 2005; (e) 2006 



1.15 Let W and C denote, respectively, the number of bushels of wheat and corn produced during the 
year t on the Tyson farm of Problem 1.14. It is clear Wand C are both functions of t; this we can 
indicate by W=F{i) and C=G(t). 

(a) Find W when ? = 2004. 

(b) Find C when t = 2002. 

(c) Find t when W= 205. 

(d) Find F(2005). 

(e) Find G(2005). 
(/) Find C when W= 190. 
SOLUTION 

(a) 190 

(b) 80 

, i „ K | :o,i; 

(d) 205 

(c) 115 

(/) no 

(g) The years 2002 through 2006. 

(h) Yes, since to each value that t can assume, there corresponds one and only one value of W. 
(0 Yes, the functional dependence of t on IF can be written t = H(W). 

(j) Yes. 

( k) Physically, it is customary to think of W as determined from t rather than t as determined from W. 
Thus, physically, t is the dependent variable and W is the independent variable. Mathematically, however, 
either variable can be considered the independent variable and the other the dependent. The one that is 
assigned various values is the independent variable; the one that is then determined as a result is the 
dependent variable. 



(g) What is the domain of the variable f> 

(h) Is IF a single-valued function of fl 

(i) Is t a function of Wl 
(J) Is C a function of Wl 

(k) Which variable is independent, t or Wl 



1.16 A variable Y is determined from a variable X according to the equation Y = 2X — 3, where the 2 
and 3 are exact. 

(a) Find Y when X=3, -2, and 1.5. 

(b) Construct a table showing the values of Y corresponding to X — —2, —1,0, 1, 2, 3, and 4. 

(c) If the dependence of Y on X is denoted by Y = F(X), determine F(2A) and F(0.8). 

(d) What value of X corresponds to Y = 15? 

(e) Can X be expressed as a function of F? 
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(f) Is Y a single-valued function of XI 

(g) Is X a single-valued function of Yl 

SOLUTION 

(a) When X=3, F = 2X-3 = 2(3)-3 = 6-3 = 3. When X=-2, F = 2X-3 = 2(-2)-3 = -4-3 = -7. 

WhenX = 1.5, F = 2X-3 = 2(1.5)-3 = 3-3 = 0. 
(/>) The values of Y, computed as in part (a), are shown in Table 1.2. Note that by using other values of X, we 

can construct many tables. The relation Y = 2X - 3 is equivalent to the collection of all such possible 

tables. 

Table 1.2 

X -2 -1 12 3 4 
Y -7-5-3-1135 



(c) F(2.4) = 2(2.4) - 3 = 4.8 - 3 = 1.8, and F(0.8) = 2(0.8) - 3 = 1.6 - 3 = -1.4. 

(d) Substitute Y = 15 in Y = 2X — 3. This gives 15 = 2X - 3, 2Y = 18, and X = 9. 

(e) Yes. Since F = 2X — 3, F + 3 = 2X and X = \ ( F + 3). This expresses X explicitly as a function of F. 
(/) Yes, since for each value that X can assume (and there is an indefinite number of these values) there 

corresponds one and only one value of F. 
(g) Yes, since from part (e), X = \ ( F + 3), so that corresponding to each value assumed by F there is one 
and only one value of X. 

1.17 If Z = 16 + AX - 3F, find the value of Z corresponding to (a) X = 2, Y = 5; (b) X = -3, 

Y = -7; (c) Z = -4, 7 = 2. 

SOLUTION 

(a) Z= 16 + 4(2) -3(5) = 16 + 8-15 = 9 

(/>) Z= 16 + 4(-3)-3(-7) = 16- 12 + 21 =25 

(c) Z=16 + 4(-4)-3(2) = 16-16-6 = -6 

Given values of X and F, there corresponds a value of Z. We can denote this dependence of Z on X and 
F by writing Z = F(X, F) (read "Z is a function of X and F"). F(2, 5) denotes the value of Z when X = 2 
and F = 5 and is 9, from part (a). Similarly, F(-3, -7) = 25 and F(-A, 2) = -6 from parts (b) and (c), 
respectively. 

The variables X and F are called independent variables, and Z is called a dependent variable. 

1.18 The overhead cost for a company is $1000 per day and the cost of producing each item is $25. 

(a) Write a function that gives the total cost of producing x units per day. 

(b) Use EXCEL to produce a table that evaluates the total cost for producing 5, 10, 15, 20, 25, 30, 35, 40, 
45, and 50 units a day. 

(c) Evaluate and interpret /(100). 
SOLUTION 

(a) /<X)= 1000 + 25x. 

(b) The numbers 5, 10, ... , 50 are entered into B1:K1 and the expression =1000 + 25*B1 is entered into B2 
and a click-and-drag is performed from B2 to K2 to form the following output. 

x 5 10 15 20 25 30 35 40 45 50 

f(x) 1025 1050 1075 1100 1125 1150 1175 1200 1225 1250 

(c) /(100)= 1000 + 25(100)= 1000 + 2500 = 3500. The cost of manufacturing x= 100 units a day is 3500. 



16 



VARIABLES AND GRAPHS 



[CHAP. 1 



1.19 A rectangle has width x and length x+ 10. 

(a) Write a function, A(x), that expresses the area as a function of x. 

(b) Use EXCEL to produce a table that evaluates A(x) for x = 0, 1, . . ., 5. 

(c) Write a function, P(.v), that expresses the perimeter as a function of x. 

(d) Use EXCEL to produce a table that evaluates P(x) for x = 0, 1, . . . , 5. 

SOLUTION 

(a) A(x) = x{x+I0) = x 2 +I0x 

(b) The numbers 0, 1, 2, 3, 4, and 5 are entered into B1:G1 and the expression =BU2+10*B1 is entered 
into B2 and a click-and-drag is performed from B2 to G2 to form the output: 

X 1 2 3 4 5 

A(x) 11 24 39 56 75 

(c) P(x) = x + (x+ 10) + x + (x + 10) = Ax + 20. 

(a?) The numbers 0, 1, 2, 3, 4, and 5 are entered into B1:G1 and the expression =4*B1 +20 is entered into 
B2 and a click-and-drag is performed from B2 to G2 to form the output: 
X 1 2 3 4 5 

P(x) 20 24 28 32 36 40 



1.20 Locate on a rectangular coordinate system the points having coordinates (a) (5, 2), (b) (2, 5), 
(c) (-5, 1), (d) ((1, - 3), (e) (3, - 4), (J) (-2.5, - 4.8), (g) (0, - 2.5), and (h) (4, 0). Use MAPLE 
to plot the points. 

SOLUTION 

See Fig. 1-2. The MAPLE command to plot the eight points is given. Each point is represented by 
a circle. 

L: = [[5, 2], [2, 5], [-5, 1], [1, -3], [3, -4], [-2.5, -4.8], [0, -2.5], [4, 0]]; 
pointplot (L, font = [TIMES, BOLD, 14], symbol = cirlce) ; 



-3 -2-11) 1 2 3 



Fig. 1-2 A MAPLE plot of points. 
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1.21 Graph the equation Y = AX - 4 using MIN1TAB. 
SOLUTION 

Note that the graph will extend indefinitely in the positive direction and negative direction of the x axis. 
We arbitrarily decide to go from —5 to 5. Figure 1-3 shows the MINITAB plot of the line Y = 4X — 4. 
The pull down "Graph Scatterplots" is used to activate scatterplots. The points on the line are formed by 
entering the integers from -5 to 5 and using the MINITAB calculator to calculate the corresponding Y 
values. The Zand Y values are as follows. 

X -5 -4 -3 -2 -1 1 2 3 4 5 



The points are connected to get an idea of what the graph of the equation Y = 4X — 4 looks like. 




Fig. 1-3 A MINITAB graph of a linear fund! 



1.22 Graph the equation Y = 2X 2 - 3X - 9 using EXCEL. 
SOLUTION 



Table 1.3 EXCEL-Generated Values for a Quadratic Function 

X -5 -4 -3 -2 -1 1 2 3 4 5 



Y 56 35 18 5 -4 -9 -10 -7 11 26 



EXCEL is used to build Table 1 .3 that gives the T values for X values equal to —5, — 4, ... , 5. The expression 
=2*B1 A 2-3*Bl-9 is entered into cell B2 and a click-and-drag is performed from B2 to L2. The chart wizard, in 
EXCEL, is used to form the graph shown in Fig. 1-4. The function is called a quadratic function. The roots 
(points where the graph cuts the x axis) of this quadratic function are at X= 3 and one is between -2 and - 1 . 
Clicking the chart wizard in EXCEL shows the many different charts that can be formed using EXCEL. Notice 
that the graph for this quadratic function goes of to positive infinity as Zgets larger positive as well as larger 
negative. Also note that the graph has its lowest value when X is between and 1 . 



1.23 Table 1.4 shows the rise in new diabetes diagnoses from 1997 to 2005. Graph these data. 



Table 1.4 Number of New Diabetes Diagnc 



Year 
Millions 



1997 
O.HS 
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Fig. 1-4 An EXCEL plot of the curve called a parabola. 



SOLUTION 

First method 

The first plot is a time series plot and is shown in Fig. 1-5. It plots the number of new cases of diabetes for 
each year from 1997 to 2005. It shows that the number of new cases has increased each year over the time 
period covered. 

Second method 

Figure 1-6 is called a bar graph, bar chart, or bar diagram. The width of the bars, which are all equal, have no 
significance in this case and can be made any convenient size as long as the bars do not overlap. 




1997 1998 1999 2000 2001 2002 2003 2004 2005 
Year 

Fig. 1-5 MINITAB time series of new diagnoses of diabetes per year. 
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2000 2001 2002 2003 2004 2005 
Year 



Fig. 1-6 MINITAB bar graph of new diagnoses of diabetes per year. 



Third method 

A bar chart with bars running horizontal rather than vertical is shown in Fig. 1-7. 




Millions 

Fig. 1-7 MINITAB horizontal bar graph of new diagnoses of diabetes per year. 



1.24 Graph the data of Problem 1.14 by using a time series from MINITAB, a clustered bar with three 
dimensional (3-D) visual effects from EXCEL, and a stacked bar with 3-D visual effects from 
EXCEL. 



The solutions are given in Figs. 1-8, 1-9, and 1-10. 
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Variable 
— •— Bushels of wheat 
Bushels of corn 



2002 2003 
Fig. 1-8 MINITAB tin 



2004 
Year 



2005 2006 
for wheat and corn production (2002-2006). 



2002 2003 2004 2005 2006 
Year 

Fig. 1-9 EXCEL clustered bar with 3-D visual effects. 



1.25 (a) Express the yearly number of bushels of wheat and corn from Table 1.1 of Problem 1.14 as 
percentages of the total annual production. 
(b) Graph the percentages obtained in part (a). 



SOLUTION 

(a) For 2002, the percentage of wheat = 205/(205 + 80) = 71.9% and the percentage of corn = 
100% - 71.9% = 28.1%, etc. The percentages are shown in Table 1.5. 

(b) The 700% stacked column compares the percentage each value contributes to a total across categories 
(Fig. 1-11). 
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□ Bushels of corn 

□ Bushels of wheat 



Fig. 1-10 EXCEL stacked bar with 3-D visual effect. 
Table 1.5 Wheat and corn production from 2002 till 2006 



Year 


Wheat (%) 


Corn (%) 


2002 


71.9 


28.1 


2003 


67.2 


32.8 


2004 


63.3 


36.7 


2005 


64.1 


35.9 


2006 


65.2 


34.8 



□ Corn (%) 

□ Wheat (%) 



002 2003 2004 2005 2006 
Year 

Fig. 1-11 EXCEL 100% stacked column. 
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1.26 In a recent issue of USA Today, in a snapshot entitled "Perils online", a survey of 1500 kids of 
ages 10-17 was reported. Display as a clustered column bar chart and a stacked column bar chart 
the information in Table 1.6. 



Table 1.6 




Solicitation 


Exposure to Pornography 


Harassment 


2000 


19% 


25% 


6% 


2005 


13% 


34% 


9% 



SOLUTION 

Figure 1-12 shows a clustered column bar chart and Fig. 1-13 shows a stacked column bar chart of the 
information. 









rn 



Solicitation Exposure to Harassment 

pornography 

Fig. 1-12 EXCEL clustered column bar chart. 



70 



50 




Solicitation Exposure to Harassment 

pornography 

Fig. 1-13 EXCEL stacked column bar chart. 



CHAP. 1] 



VARIABLES AND GRAPHS 



23 



1.27 In a recent USA Today snapshot entitled "Where the undergraduates are", it was reported that 
more than 17.5 million undergraduates attend more than 6400 schools in the USA. Table 1.7 gives 
the enrollment by type of school. 

Table 1.7 Where the undergraduates are 



Type of School 


Percent 


Public 2-year 


43 


Public 4-year 


32 


Private non-profit 4-year 


15 


Private 2- and 4-year 


6 


Private less than 4-year 


3 


Other 


1 



Construct a 3-D bar chart of the information in Table 1.7 using EXCEL and a bar chart using 
MINITAB. 

SOLUTION 

Figures 1-14 and 1-15 give 3-D bar charts of the data in Table 1.6. 




Fig. 1-14 EXCEL 3-D Bar chart of Table 1.7 information. 




Type of School 



Fig. 1-15 MINITAB bar chart of Table 1.7 information. 
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1.28 Americans have an average of 2.8 working televisions at home. Table 1.8 gives the breakdown. 
Use EXCEL to display the information in Table 1.8 with a pie chart. 

Table 1.8 Televisions per household 

Televisions Percent 

None 2 

One 15 

Two 29 

Three 26 

Five + 12 

SOLUTION 

Figure 1-16 gives the EXCEL pie chart for the information in Table 1.8. 




EQUATIONS 

1.29 Solve each of the following equations: 

(a) 4a -20 = 8 (c) 18 - 5b = 3(b + 8) + 10 

(b) 3X + 4-24-2X (d) — ^+ 1= f 

SOLUTION 

(a) Add 20 to both members: 4a - 20 + 20 = 8 + 20, or 4a = 28. 
Divide both sides by 4: 4a/4 = 28/4, and a = 7. 

Check: 4(7) - 20 = 8, 28 - 20 = 8, and 8 = 8. 

(b) Subtract 4 from both members: 3X + 4 - 4 = 24 - 2X - 4, or 3X = 20 - 2X. 
Add 2X to both sides: 3X + 2X = 20 - 2X + 2X, or 5X = 20. 

Divide both sides by 5: 5Z/5 = 20/5, and X = 4. 

Check: 3(4) + 4 = 24 - 2(4), 12 + 4 = 24 - 8, and 16 = 16. 

The result can be obtained much more quickly by realizing that any term can be moved, or 
transposed, from one member of an equation to the other simply by changing its sign. Thus we can write 
3X + 4 = 24-2X 3Z + 2Z = 24-4 5X = 20 X = 4 
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(c) 18 - 5b = 3b + 24 + 10, and 18 - 5b = 3b + 34. 

Transposing, -5b - 3b = 34 - 18, or -Sb = 16. 

Dividing by -8, -86/(-8) = 16/(— 8), and b = -2. 

Check: 18 - 5(-2) = 3(-2 + 8) + 10, 18 + 10 = 3(6) + 10, and 28 = 28. 
(a") First multiply both members by 6, the lowest common denominator. 

6(^+1) =6(|) 6(^)+6(lHf 2(7 + 2) + 6 = 37 

27+4+6=37 27+10 = 37 10 = 37-27 7= 10 

C/^.- 1^+1=^+1=^,4+1 = 5, and5 = 5. 

1.30 Solve each of the following sets of simultaneous equations: 

(a) 3a -26 =11 (6) 5X+14F=78 (c) 3a + 26 + 5c =15 

5a + 76 = 39 7X + 37=-7 7a - 3b + 2c = 52 

5a + 6 - 4c = 2 



SOLUTION 

(a) Multiply the first equation by 7: 21a - 146 = 77 (7) 

Multiply the second equation by 2: 10a+146= 78 (2) 

Add: 31a = 155 

Divide by 31: a=5 

Note that by multiplying each of the given equations by suitable numbers, we are able to write 
two equivalent equations, (1) and (2), in which the coefficients of the unknown b are numerically equal. 
Then by addition we are able to eliminate the unknown b and thus find a. 

Substitute a = 5 in the first equation: 3(5) -2b=\\, -2b = -4, and b = 2. Thus a = 5, and 6 = 2. 
C6cc£: 3(5) - 2(2) = 1 1, 15 - 4 = 1 1, and 1 1 = 1 1; 5(5) + 7(2) = 39, 25 + 14 = 39, and 39 = 39. 

(6) Multiply the first equation by 3: 15Z + 427 = 234 (3) 

Mutiply the second equation by -14: -98X-427= 98 (4) 

Add: -83X = 332 

Divide by -83: X = -4 

Substitute X = -4 in the first equation: 5(-4) + 147 = 78, 147 = 98, and 7 = 7. 
Thus X = -4, and 7 = 7. 

Check: 5(-4) + 14(7) = 78, -20 + 98 = 78, and 78 = 78; 7(-4) + 3(7) = -7, -28 + 21 = -7, and 
-7 = -7. 

(c) Multiply the first equation by 2: 6a + 4b + 10c = 30 

Multiply the second equation by -5: -35a + 15b - 10c = -260 

Add: -29a + 19/) =-230 (5) 

Multiply the second equation by 2: 14a - 6h+ 4c = 104 

Repeat the third equation: 5a + h - 4c = 2 

Add: 19a- 5b = 106 (6) 

We have thus eliminated c and are left with two equations, (5) and (<5), to be solved simultaneously 
for a and b. 

Multiply equation (5) by 5: - 145a + 95b = - 1 1 50 

Multiply equation (6) by 19: 361a -956= 2014 

Add: 216a = 864 

Divide by 216: a = 4 

Substituting a = 4 in equation (5) or (6), we find that b = -6. 

Substituting a = 4 and b = -6 in any of the given equations, we obtain c = 3. 
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Thus a = 4, b = -6, and c = 3. 

Check: 3(4) + 2(-6) + 5(3) = 15, and 15 = 15; 7(4) - 3(-6) + 2(3) = 52, and 52 = 52; 5(4) + (-6) 
-4(3) = 2, and 2 = 2. 

INEQUALITIES 

1.31 Express in words the meaning of each of the following: 

(a) N>30 (b) X<12 (c) < p < 1 (d) /i - 2t < X < p + 2t 

SOLUTION 

(a) N is greater than 30. 

(b) X is less than or equal to 12. 

(c) p is greater than but less than or equal to 1. 

(d) X is greater than /i - It but less than /i + It. 

1.32 Translate the following into symbols: 

(a) The variable X has values between 2 and 5 inclusive. 

(b) The arithmetic mean X is greater than 28.42 but less than 31.56. 

(c) m is a positive number less than or equal to 10. 

(d) P is a nonnegative number. 

SOLUTION 

(a) 2 < X < 5; (b) 28.42 < X < 31.56; (c) < m < 10; (d) P > 0. 

1.33 Using inequality symbols, arrange the numbers 3.42, -0.6, -2.1, 1.45, and -3 in (a) increasing 
and (b) decreasing order of magnitude. 

SOLUTION 

(a) -3 < -2.1 < -0.6 < 1.45 < 3.42 

(b) 3.42 > 1.45 > -0.6 > -2.1 > -3 

Note that when the numbers are plotted as points on a line (see Problem 1.18), they increase from left to 
right. 

1.34 In each of the following, find a corresponding inequality for X (i.e., solve each inequality for X): 



SOLUTION 

(a) Divide both sides by 2 to obtain X < 3. 

(b) Adding 8 to both sides, 3X > 12; dividing both sides by 3, X > 4. 

(c) Adding -6 to both sides, -4X < -8; dividing both sides by -4, X > 2. Note that, as in equations, we 
can transpose a term from one side of an inequality to the other simply by changing the sign of the term; 
from part (b), for example, 3X > 8 + 4. 

(d) Multiplying by 2, -6 < X - 5 < 6; adding 5, - 1< X < 1 1 . 

(e) Multiplying by 5, -5 < 3 - 2X < 35; adding -3, -8 < -2X < 32; dividing by -2, 4 > X > -16, or 
-16<X<4. 



(a) 2X < 6 



(c) 6 - AX < -2 




(b) 3X - 8 > 4 



<3 
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LOGARITHMS AND PROPERTIES OF LOGARITHMS 

1.35 Use the definition of y = \og h x to find the following logarithms and then use EXCEL to verify 
your answers. (Note that y = log/,* means that b y = x.) 

(a) Find the log to the base 2 of 32. 

(b) Find the log to the base 4 of 64. 

(c) Find the log to the base 6 of 216. 

(d) Find the log to the base 8 of 4096. 

(e) Find the log to the base 10 of 10,000. 

SOLUTION 

(a) 5; (b) 3; (c) 3; (d) 4; (e) 4. 

The EXCEL expression =LOG(32,2) gives 5, =LOG(64,4) gives 3, =LOG(216,6) gives 3, =LOG(4096,8) 
gives 4, and =LOG( 10000, 10) gives 4. 

1.36 Using the properties of logarithms, re-express the following as sums and differences of logarithms. 

» '#) m 10 #) 

Using the properties of logarithms, re-express the following as a single logarithm. 

(c) ln(5) + ln(10)-21n(5) (d) 2 log(5) - 3 log(5) + 5 log(5) 

SOLUTION 

(a) 2 ln(x) + 3 ln(y) + ln(z) - ln(a) - ln(6) 

(ft) 2 log(a) + 3 log(A) + log(c) - log(v) - log(z) 

(c) ln(2) 

(d) log(625) 

1.37 Make a plot of >• = ln(x) using SAS and SPSS. 
SOLUTION 

The solutions are shown in Figures 1-17 and 1-18. 
3.00 - 

2.00 - 

1.00- 
c 0.00 - 
-1 .00 - 
-2.00 - 
-3.00 - 

0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 
Fig. 1-17 SPSS plot of y = ln(jt). 
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4 6 8 10 

Fig. 1-18 SAS plot of r = ln(jc). 



A plot of the curve j = ln(x) is shown in Figures 1-17 and 1-18. As x gets close to the values 
for ln(x) get closer and closer to -co. As x gets larger and larger, the values for ln(x) get closer 



LOGARITHMIC EQUATIONS 

1.38 Solve the logarithmic equation ln(x)= 10. 
SOLUTION 

Using the definition of logarithm, x = e 10 = 22026.47. As a check, we take the natural log of 22026.47 and get 
10.00000019. 



1.39 Solve the logarithmic equation log(x + 2) + \og(x - 2) = log(5) 
SOLUTION 

The left-hand side may be written as log[(x + 2)(x - 2)]. We have the equation log(.v + 2)(x - 2) = log(5) 
from which (x + 2)(x — 2) = (5). From which follows the equation x 2 — 4 = 5 or x 2 = 9 or x = —3 or 3. 
When these values are checked in the original equation, x = —3 must be discarded as a solution because 
the log is not defined for negative values. When x=3 is checked, we have log(5) + log(l) = log(5) since 
log(l) = 0. 



1.40 Solve the logarithmic equation log(a + 4) - log(a - 2) = 1. 
SOLUTION 

The equation may be rewritten as log((«+ 4)/(a — 2)) = 1. Applying the definition of a logarithm, we 
have (« + 4)/(a-2) = lO'or a + 4= 10a -20. Solving for a, a = 24/9 = 2.6 (with the 6 repeated). 
Checking the value 2.6667 into the original equation, we have 0.8239 - (-0.1761) = 1. Fhe only solu- 
tion is 2.6667. 
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1.41 Solve the logarithmic equation ln(x) 2 -1=0. 
SOLUTION 

The equation may be factored as [ln(x) + l][ln(x) - 1] = 0. Setting the factor ln(x) + 1=0, we have 
Iii(a-) = -1 or x = e _1 = 0.3678. Setting the factor ln(x) - 1 = 0, we have ln(x) = 1 or x = e 1 = 2.7183. 
Both values are solutions to the equation. 



1.42 Solve the following equation for x: 21og(x + 1) — 3 log(x + 1) = 2. 
SOLUTION 

The equation may be rewritten as log[(x + l) 2 /(x + l) 3 ] = 2 or log[l/(x + 1)] = 2 or log(l) - log(x + 1) = 2 
or - log(x + 1) = 2 or log(x + 1) = -2 or x + 1 = 10~ 2 or x = -0.99. Substituting into the original 
equation, we find 21og(0.01) - 31og(0.01) = 2. Therefore it checks. 



1.43 The software package MAPLE may be used to solve logarithmic equations that are not easy to 
solve by hand. Solve the following equation using MAPLE: 

log(x + 2)-ln(x 2 )=4 
SOLUTION 

The MAPLE command to solve the equation is "solve(log 10(x + 2) 
tion is given as —0.154594. 

Note that MAPLE uses loglO for the common logarithm. 

To check and see if the solution is correct, we have log( 1.845406) - 
4.00001059. 



- ln(x"2) = 4);" The solu- 

- ln(0.023899) which equals 



1.44 EXCEL may be used to solve logarithmic equations also. Solve the following logarithmic equa- 
tion using EXCEL: log(x + 4) + ln(x + 5) = 1 . 

SOLUTION 

Figure 1.19 gives the EXCEL worksheet. 



-3 -0.30685 LOG10(A1+4)+LN(A1+5)- 1 

-2 0.399642 

-1 0.863416 

1.211498 

1 1 .490729 

2 1.724061 

3 1 .92454 

4 2.100315 

5 2.256828 



-3 -0.30685 LOG10(A11+4)+LN(A11+5)- 1 

-2.9 -0.21667 

-2.8 -0.13236 

-2.7 -0.05315 

-2.6 0.021597 

-2.5 0.092382 

-2.4 0.159631 

-2.3 0.223701 

-2.2 0.284892 

-2.1 0.343464 

-2 0.399642 



Fig. 1-19 EXCEL worksheet for Problem 1.44. 
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The iterative technique shown above may be used. The upper half locates the root of 
log(x + 4) + ln(x + 5) - 1 between -3 and -2. The lower half locates the root between 
-2.7 and -2.6. This process may be continued to give the root to any desired accuracy. Click- 
and-drag techniques may be used in doing the technique. 

1.45 Solve the equation in Problem 1.44 using the software package MAPLE. 
SOLUTION 

The MAPLE command "> solve(loglO(x + 4) + ln(x + 5) = 1);" gives the solution -2.62947285. Compare 
this with that obtained in Problem 1.44. 



Supplementary Problems 



State which of the following represent discn 

(a) Number of inches of rainfall in a city during variou: 

(b) Speed of an automobile in miles per hour 

(c) Number of $20 bills circulating in the United States 

(d) Total value of shares sold each day in the stock market 

(e) Student enrollment in a university over a number of years 



nd which represent 
rious months of the year 



Give the domain of each of the following variables and state whether the variables are continuous or 
discrete: 



(a) Number W of bushels of wheat produced 
per acre on a farm over a number of years 

(b) Number N of individuals in a family 



(c) Marital status of an individual 

(d) Time T of flight of a missile 

(e) Number P of petals on a flower 



ROUNDING OF DATA, SCIENTIFIC NOTATION, AND SIGNIFICANT FIGURES 

1.48 Round each of the following numbers to the indicated accuracy: 



(a) 


3256 


nearest hundred 


(/) 3,502,378 nearest million 


(b) 


5.781 


nearest tenth 


(g) 148.475 nearest unit 


(c) 


0.0045 


nearest thousandth 


(h) 0.000098501 nearest millionth 


id) 


46.7385 


nearest hundredth 


(0 2184.73 nearest ten 


(e) 


125.9995 


two decimal places 


(J) 43.87500 nearest hundredth 


Exp 


ress each : 


number without using powers of 10. 


(a) 


132.5 x 10 4 (c) 280 x 10"' 


' (e) 3.487 x 10~ 4 


ib) 


418.72 x 


10~ 5 (d) 7300 x 10 f 


(/) 0.0001850 x 10 5 
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1.50 How many significant figures are in each of the following, assuming that the numbers are accurately 
recorded? 

(a) 2.54 cm (d) 3.51 million bu (/) 378 people (h) 4.50 x 10" 3 km 

(b) 0.004500 yd (e) 10.000100 ft (g) 378 oz (i) 500.8 x 10 5 kg 

(c) 3,510,000 bu O) lOO.OOmi 

1.51 What is the maximum error in each of the following measurements, assumed to be accurately recorded? Give 
the number of significant figures in each case. 

(a) 7.20 million bu (c) 5280 ft (e) 186,000 mi/sec 

(b) 0.00004835 cm (d) 3.0 x 10 8 m (/) 186 thousand mi/sec 

1.52 Write each of the following numbers using the scientific notation. Assume that all figures are significant 
unless otherwise indicated. 

(a) 0.000317 (d) 0.000009810 

(b) 428,000,000 (four significant figures) (e) 732 thousand 

(c) 21,600.00 (/) 18.0 ten-thousandths 



COMPUTATIONS 

1.53 Show that (a) the product and (b) the quotient of the numbers 72.48 and 5.16, assumed to have four and 
three significant figures, respectively, cannot be accurate to more than three significant figures. Write the 
accurately recorded product and quotient. 

1.54 Perform each indicated operation. Unless otherwise specified, assume that the numbers are accurately 
recorded. 

(a) 0.36 x 781.4 (g) 14.8641 +4.48 - 8.168 + 0.36125 

... 873.00 



(c) 5.78 x 2700 x 16.00 
, „ 0.00480 x 2300 



(h) 4, 1 73,000 - 1 70,264 + 1 ,820,470 - 78,320 

(numbers are respectively accurate to four, six, six, 
and five significant figures) 



(e) V120 x 0.5386 x 0.4614 (120 exa ( 
(416,000)(0.000187) 
VT3M 

1.55 Evaluate each of the following, given that U = -2, V = \, W = 3, X = -A, Y = 9, and Z = \, where all 
numbers are assumed to be exact. 



W * V + W-2W fe, f^ + ^ 

UVW \/(F-4) 2 + ([/ + 5) 2 

M ?ri~ 3 ™ © X 3 + 5X 2 -6X-8 
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(d) 3(u-xy+Y 



= [U 2 V{W + X)\ 



(e) W 2 -2UV+ W 

(f) 3X(4F + 3Z)-2F(6X-5Z)-25 



FUNCTIONS, TABLES, AND GRAPHS 

1.56 A variable F is determined from a variable X according to the equation Y = 10 — AX. 

(a) Find Y when X = -3, -2, -1, 0, 1, 2, 3, 4, and 5, and show the results in a table. 

(b) Find Y when X = -2.4, -1.6, -0.8, 1.8, 2.7, 3.5, and 4.6. 

(c) If the dependence of Y on X is denoted by Y = F(X), find F(2.8), F(-5), F(x/2), and F(-tt). 

(d) What value of X corresponds to Y — -2, 6, -10, 1.6, 16, 0, and 10? 

(e) Express X explicitly as a function of Y. 

1.57 If Z = X 2 - F 2 , find Z when (a) X = -2, F = 3, and (b) X = 1, F = 5. (c) If using the functional notation 
Z = F(Z, F), find F(-3, -1). 

1.58 If W = 3XZ - 4Y 2 + 2XY, find W when (a) X =\, F = -2, Z = 4, and (b) X = -5, F = -2, Z = 0. 
(c) If using the functional notation W = F(X, F, Z), find F(3, 1, -2). 

1.59 Locate on a rectangular coordinate system the points having coordinates (a) (3, 2). (b) (2. 3), (c) (-4, 4), 
(rf) (4, -4), (e) (-3, -2), (/) (-2, -3), fe) (-4.5, 3), (h) (-1.2, -2.4), (0 (0, -3), and (J) (1-8, 0). 

1.60 Graph the equations (a) Y = W - 4X (see Problem 1.56), (b) F = 2X+5, (c) F = ±LT - 6), 
{d) 2X+3Y= 12, and (e) 3X-2Y = 6. 

1.61 Graph the equations (a) F = 2X 2 + X - 10, and (b) Y = 6-3X- X 2 . 

1.62 Graph Y = X 3 - 4X 2 + \2X - 6. 

1.63 Table 1 .9 gives the number of health clubs and the number of members in millions for the years 2000 through 
2005. Use software to construct a time-series plot for clubs and construct a separate time-series plot for 
members. 



Clubs 
Members 



13000 
32.5 



13225 
35.0 



15000 
36.5 



20000 
39.0 



25500 
41.0 



28500 
41.3 



1.64 Using the data of Table 1 .9 and software, construct bar charts for clubs and for members. 

1.65 Using the chart wizard of EXCEL, construct a scatter plot for the clubs as well as the members in Table 1 .9. 



1.66 Table 1.10 shows the infant deaths per 1000 live births for whites and non-whites in a Midwestern state for 
the years 2000 through 2005. Use an appropriate graph to portray the data. 
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White 
Non-white 



1.67 Table 1.11 shows the orbital velocities of the planets in our solar system. Graph the data. 

Table 1.11 



Planet 


Mercury 




Earth 






Saturn 


Uranus 


Neptune 


Pluto 


Velocity 
(mi/sec) 


29.7 


21.8 


18.5 


15.0 


8.1 


6.0 


4.2 


3.4 


3.0 



1.68 Table 1.12 gives the projected public school enrollments (in thousands) for K through grade 8, grades 9 
through 12, and college for the years 2000 through 2006. Graph the data, using line graphs, bar graphs, and 
component bar graphs. 

1.69 Graph the data of Table 1.12 by using a percentage component graph. 



Table 1.12 



Year 


2000 


2001 


2002 


2003 


2004 


2005 


2006 


K— grade 8 


33,852 


34,029 


34,098 


34,065 


33,882 


33,680 


33,507 


Grades 9-12 


13,804 


13,862 


14,004 


14,169 


14,483 


14,818 


15,021 


College 


12,091 


12,225 


12,319 


12,420 


12,531 


12,646 


12,768 



Source: U.S. National Center for Educational Statistics and Projections of Education Statistics, annual. 



1.70 Table 1.13 shows the marital status of males and females (18 years and older) in the United States as of the 
year 1995. Graph the data, using (a) two pie charts having the same diameter and (b) a graph of your own 
choosing. 



Table 1.13 





Male 


Female 


Marital Status 


(percent of total) 


(percent of total) 


Never married 


26.8 


19.4 


Married 


62.7 


59.2 


Widowed 


2.5 


11.1 


Divorced 


8.0 


10.3 



Source: U.S. Bureau of Census — Current Population Reports. 



1.71 Table 1.14 gives the number of inmates younger than 18 held in state prisons during the years 2001 through 
2005. Graph the data using the appropriate types of graphs. 



Table 1.14 



Year 


2001 


2002 


2003 


2004 


2005 


number 


3,147 


3,038 


2,741 


2,485 


2,266 
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Table 1.15 gives the total number of visits to the Smithsonian Institution in millions during the years 2001 
through 2005. Construct a bar chart for the data. 



1.73 To the nearest million, Table 1.16 shows the seven countries of the world with the largest populations as of 
1997. Use a pie chart to illustrate the populations of the seven countries of the world with the largest 
populations. 



Table 1.16 



Country 


United 

China India States Indonesia Brazil Russia Pakistan 


Population 
(millions) 


1,222 968 268 210 165 148 132 



Source: U.S. Bureau of the Census, International database. 



1.74 A Pareto chart is a bar graph in which the bars are arranged according to the frequency values so that the 
tallest bar is at the left and the smallest bar is at the right. Construct a Pareto chart for the data in Table 1.16. 

1.75 Table 1.17 shows the areas of the oceans of the world in millions of square miles. Graph the data, using 
(a) a bar chart and (b) a pie chart. 



Table 1.17 



Ocean 


Pacific Atlantic Indian Antarctic Arctic 


Area (million 
square miles) 


63.8 31.5 28.4 7.6 4.8 



Source: United Nations. 



EQUATIONS 

1.76 Solve each of the following equations: 

(a) 16 -5c =36 (c) 4{X - 3) - 11 = 15 - 2{X + 4) (e) 3[2(X + 1) - 4] = 10 - 5(4 - 2X) 

(b) 27-6 = 4-37 (d) 3(2U + 1) = 5(3 - U) + 3(U - 2) (/) §(12 + Y) = 6 - \(9 - Y) 

1.77 Solve each of the following simultaneous equations: 
[a) 2a + b = 10 (e) 2a + b - c = 2 

la - 3b = 9 3a - 4b + 2c = 4 

b) 3a+5b = 24 Aa + 3b- 5c = -8 

2a + 3b = 14 (/) 5X + 2Y + 3Z = -5 
8X-3F = 2 2X-3F-6Z=1 
3X+7Y = -9 X + 5Y-4Z = 22 

'd) 5A - 9B =-10 (g) 3U-5V + 6W=1 
3.4-45=16 5U+3V-2W = -l 

4U- 8F+ 101^= 11 
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1.78 (a) Graph the equations 5X + 2Y = 4 and IX - 3F = 23, using the same set of coordinate axes. 

(b) From the graphs determine the simultaneous solution of the two equations. 

(c) Use the method of parts (a) and (b) to obtain the simultaneous solutions of equations (a) to (d) of 
Problem 1.77. 

1.79 (a) Use the graph of Problem 1.61(a) to solve the equation 2X 2 + X - 10 = 0. (Hint: Find the values of X 

where the parabola intersects the X axis: that is, where Y = 0.) 
(b) Use the method in part (a) to solve 3X 2 - 4X - 5 = 0. 

1.80 The solutions of the quadratic equation aX 2 + bX + c = are given by the quadratic formula: 

_ -b ± Vb 2 - 4ac 
X ~ 2a 

Use this formula to find the solutions of (a) 3X 2 - 4X - 5 = 0, (b) 2X 2 + X - 10 = 0, (c) 5X 2 + 10X = 7, 
and (d) X 2 + 8Z + 25 = 0. 

INEQUALITIES 

1.81 Using inequality symbols, arrange the numbers -4.3, -6.15, 2.37, 1.52, and -1.5 in (a) increasing and 
(b) decreasing order of magnitude. 

1.82 Use inequality symbols to express each of the following statements: 

(a) The number N of children is between 30 and 50 inclusive. 

(b) The sum S of points on the pair of dice is not less than 7. 

(c) X is greater than or equal to -4 but less than 3. 

(d) P is at most 5. 

(e) X exceeds Y by more than 2. 

1.83 Solve each of the following inequalities: 

(a) 3X>12 (d) 3 + 5(F — 2) <7 — 3(4— Y) (g) -2 < 3 + \ {a - 12) < 8 

(b) 4Z<5X-3 (e) -3<i(2Z+l)<3 

(c) 2N+ 15 > 10 + 3JV (/) <i(15-5JV) < 12 

LOGARITHMS AND PROPERTIES OF LOGARITHMS 

1.84 Find the common logarithms: 

(a) log(10) (b) log(100) (c) log(1000) (d) log(O.l) (d) log(0.01) 

1.85 Find the natural logarithms of the following numbers to four decimal places: 

(a) ln(e) (b) ln(10) (c) ln(100) (d) ln(1000) (e) ln(0.1) 



1.86 Find the logarithms: 

(a) log 4 4 (b) log 5 25 (c) log 6 216 (d) log 7 2401 (e) log 8 32768 
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1.87 Use EXCEL to find the following logarithms. Give the answers and the commands. 
(a) log 4 5 (b) log 5 24 ( C ) log 6 215 (d) log 7 8 (e) log 8 9 

1.88 Use MAPLE to Problem 1.87. Give the answers and the MAPLE commands. 

1.89 Use the properties of logarithms to write the following as sums and differences of logarithms: ln((a 3 6 4 )/c 5 ) 

1.90 Use the properties of logarithms to write the following as sums and differences of logarithms: log((.vi-)/ir) 

1.91 Replace the following by an equivalent logarithm of a single expression: 5\n(a) - 41n(/;) + ln(c) + ln(d) 

1.92 Replace the following by an equivalent logarithm of a single expression: log(w) + log(v) + log(w)— 
21og(x)-31og(j/)-41og(z) 

LOGARITHMIC EQUATIONS 

1.93 Solve log(3x - 4) = 2 

1.94 Solve ln(3x 2 - x) = ln( 10) 

1.95 Solve log(w - 2) - log(2w + 7) = log(w + 2) 

1.96 Solve ln(3x + 5) + ln(2x - 5) = 12 

1.97 Use MAPLE or EXCEL to solve ln(2x) + log(3x - 1) = 10 

1.98 Use MAPLE or EXCEL to solve log(2x) + ln(3x - 1) = 10 

1.99 Use MAPLE or EXCEL to solve ln(3x) - log(x) = log 2 3 

1.100 Use MAPLE or EXCEL to solve log 2 (3x) - log(x) = ln(3) 




Frequency 
Distributions 



RAW DATA 

Raw data are collected data that have not been organized numerically. An example is the set of 
heights of 100 male students obtained from an alphabetical listing of university records. 



ARRAYS 

An array is an arrangement of raw numerical data in ascending or descending order of magnitude. 
The difference between the largest and smallest numbers is called the range of the data. For example, 
if the largest height of 100 male students is 74 inches (in) and the smallest height is 60 in, the range 
is 74 - 60 = 14 in. 



FREQUENCY DISTRIBUTIONS 

When summarizing large masses of raw data, it is often useful to distribute the data into classes, or 
categories, and to determine the number of individuals belonging to each class, called the class frequency. 
A tabular arrangement of data by classes together with the corresponding class frequencies is called a 
frequency distribution, or frequency table. Table 2.1 is a frequency distribution of heights (recorded to the 
nearest inch) of 100 male students at XYZ University. 



Table 2.1 Heights of 100 male students at 
XYZ University 



Height 


Number of 


(in) 


Students 


60-62 


5 


63-65 


18 


66-68 


42 


69-71 


27 


72-74 


8 




Total 100 
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The first class (or category), for example, consists of heights from 60 to 62 in and is indicated by the 
range symbol 60-62. Since five students have heights belonging to this class, the corresponding class 
frequency is 5. 

Data organized and summarized as in the above frequency distribution are often called grouped data. 
Although the grouping process generally destroys much of the original detail of the data, an important 
advantage is gained in the clear "overall" picture that is obtained and in the vital relationships that are 
thereby made evident. 

CLASS INTERVALS AND CLASS LIMITS 

A symbol defining a class, such as 60-62 in Table 2.1, is called a class interval. The end numbers, 60 
and 62, are called class limits; the smaller number (60) is the lower class limit, and the larger number (62) 
is the upper class limit. The terms class and class interval are often used interchangeably, although the 
class interval is actually a symbol for the class. 

A class interval that, at least theoretically, has either no upper class limit or no lower class limit 
indicated is called an open class interval. For example, referring to age groups of individuals, the class 
interval "65 years and over" is an open class interval. 

CLASS BOUNDARIES 

If heights are recorded to the nearest inch, the class interval 60-62 theoretically includes all mea- 
surements from 59.5000 to 62.5000 in. These numbers, indicated briefly by the exact numbers 59.5 and 
62.5, are called class boundaries, or true class limits: the smaller number (59.5) is the lower class boundary. 
and the larger number (62.5) is the upper class boundary. 

In practice, the class boundaries are obtained by adding the upper limit of one class interval to the 
lower limit of the next-higher class interval and dividing by 2. 

Sometimes, class boundaries are used to symbolize classes. For example, the various classes in the first 
column of Table 2.1 could be indicated by 59.5-62.5, 62.5-65.5, etc. To avoid ambiguity in using such 
notation, class boundaries should not coincide with actual observations. Thus if an observation were 
62.5, it would not be possible to decide whether it belonged to the class interval 59.5-62.5 or 62.5-65.5. 

THE SIZE, OR WIDTH, OF A CLASS INTERVAL 

The size, or width, of a class interval is the difference between the lower and upper class boundaries 
and is also referred to as the class width, class size, or class length. If all class intervals of a frequency 
distribution have equal widths, this common width is denoted by c. In such case c is equal to the 
difference between two successive lower class limits or two successive upper class limits. For the data 
of Table 2.1, for example, the class interval is c — 62.5 - 59.5 = 65.5 - 62.5 = 3. 

THE CLASS MARK 

The class mark is the midpoint of the class interval and is obtained by adding the lower and upper 
class limits and dividing by 2. Thus the class mark of the interval 60-62 is (60 + 62)/2 = 61. The class 
mark is also called the class midpoint. 

For purposes of further mathematical analysis, all observations belonging to a given class interval 
are assumed to coincide with the class mark. Thus all heights in the class interval 60-62 in are considered 
to be 61 in. 

GENERAL RULES FOR FORMING FREQUENCY DISTRIBUTIONS 

1 . Determine the largest and smallest numbers in the raw data and thus find the range (the difference 
between the largest and smallest numbers). 
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2. Divide the range into a convenient number of class intervals having the same size. If this is not 
feasible, use class intervals of different sizes or open class intervals (see Problem 2.12). The number 
of class intervals is usually between 5 and 20, depending on the data. Class intervals are also chosen 
so that the class marks (or midpoints) coincide with the actually observed data. This tends to lessen 
the so-called grouping error involved in further mathematical analysis. However, the class bound- 
aries should not coincide with the actually observed data. 

3. Determine the number of observations falling into each class interval; that is, find the class frequen- 
cies. This is best done by using a tally, or score sheet (see Problem 2.8). 



HISTOGRAMS AND FREQUENCY POLYGONS 

Histograms and frequency polygons are two graphic representations of frequency distributions. 

1. A histogram or frequency histogram, consists of a set of rectangles having (a) bases on a horizontal 
axis (the X axis), with centers at the class marks and lengths equal to the class interval sizes, and 
(b) areas proportional to the class frequencies. 

2. A frequency polygon is a line graph of the class frequencies plotted against class marks. It can be 
obtained by connecting the midpoints of the tops of the rectangles in the histogram. 



The histogram and frequency polygon c 
Table 2.1 are shown in Figs. 2-1 and 2-2. 



rresponding to the frequency distribution of heights in 




Height (in) 

Fig. 2-1 A MINITAB histogram showing the class midpoints and frequencies. 



Notice in Fig. 2-2 how the frequency polygon is tied down at the ends, that is, at 58 and 76. 



RELATIVE-FREQUENCY DISTRIBUTIONS 

The relative frequency of a class is the frequency of the class divided by the total frequency of all 
classes and is generally expressed as a percentage. For example, the relative frequency of the class 66-68 
in Table 2.1 is 42/100 = 42%. The sum of the relative frequencies of all classes is clearly 1, or 100%. 

If the frequencies in Table 2.1 are replaced with the corresponding relative frequencies, the resulting 
table is called a relative-frequency distribution, percentage distribution, or relative-frequency table. 

Graphic representation of relative-frequency distributions can be obtained from the histogram or 
frequency polygon simply by changing the vertical scale from frequency to relative frequency, keeping 
exactly the same diagram. The resulting graphs are called relative-frequency histograms (or percentage 
histograms) and relative-frequency polygons (or percentage polygons), respectively. 
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CUMULATIVE-FREQUENCY DISTRIBUTIONS AND OGIVES 

The total frequency of all values less than the upper class boundary of a given class interval is called 
the cumulative frequency up to and including that class interval. For example, the cumulative frequency 
up to and including the class interval 66-68 in Table 2.1 is 5+18 + 42 = 65, signifying that 65 students 
have heights less than 68.5 in. 




58 61 64 67 70 73 76 

Height (in) 

Fig. 2-2 A M1NITAB frequency polygon of the student heights. 



A table presenting such cumulative frequencies is called a cumulative-frequency distribution, 
cumulative-frequency table, or briefly a cumulative distribution and is shown in Table 2.2 for the student 
height distribution of Table 2.1. 



Table 2.2 



Height 


Number of 


(in) 


Students 


Less than 59.5 





Less than 62.5 


5 


Less than 65.5 


23 


Less than 68.5 


65 


Less than 71.5 


92 


Less than 74.5 


100 



A graph showing the cumulative frequency less than any upper class boundary plotted against the 
upper class boundary is called a cumulative-frequency polygon or ogive. For some purposes, it is desirable 
to consider a cumulative-frequency distribution of all values greater than or equal to the lower class 
boundary of each class interval. Because in this case we consider heights of 59.5 in or more, 62.5 in or 
more, etc., this is sometimes called an "or more" cumulative distribution, while the one considered above 
is a "less than" cumulative distribution. One is easily obtained from the other. The corresponding ogives 
are then called "or more" and "less than" ogives. Whenever we refer to cumulative distributions or 
ogives without qualification, the "less than" type is implied. 



RELATIVE CUMULATIVE-FREQUENCY DISTRIBUTIONS AND PERCENTAGE OGIVES 

The relative cumulative frequency, or percentage cumulative frequency, is the cumulative frequency 
divided by the total frequency. For example, the relative cumulative frequency of heights less than 68.5 in 
is 65/100 = 65%, signifying that 65% of the students have heights less than 68.5 in. If the relative 
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cumulative frequencies are used in Table 2.2, in place of cumulative frequencies, the results are called 
relative cumulative-frequency distributions (or percentage cumulative distributions) and relative cumiilaiire- 
frequency polygons (or percentage ogives), respectively. 

FREQUENCY CURVES AND SMOOTHED OGIVES 

Collected data can usually be considered as belonging to a sample drawn from a large population. 
Since so many observations are available in the population, it is theoretically possible (for continuous 
data) to choose class intervals very small and still have sizable numbers of observations falling within 
each class. Thus one would expect the frequency polygon or relative-frequency polygon for a large 
population to have so many small, broken line segments that they closely approximate curves, which 
we call frequency curves or relative-frequency curves, respectively. 

It is reasonable to expect that such theoretical curves can be approximated by smoothing the fre- 
quency polygons or relative-frequency polygons of the sample, the approximation improving as the sample 
size is increased. For this reason, a frequency curve is sometimes called a smoothed frequency polygon. 

In a similar manner, smoothed ogives are obtained by smoothing the cumulative-frequency polygons, 
or ogives. It is usually easier to smooth an ogive than a frequency polygon. 

TYPES OF FREQUENCY CURVES 

Frequency curves arising in practice take on certain characteristic shapes, as shown in Fig. 2-3. 




Skewed to the left Uniform 



Fig. 2-3 Four common distributions. 

1. Symmetrical or bell-shaped curves are characterized by the fact that observations equidistant from 
the central maximum have the same frequency. Adult male and adult female heights have bell- 
shaped distributions. 

2. Curves that have tails to the left are said to be skewed to the left. The lifetimes of males and females 
are skewed to the left. A few die early in life but most live between 60 and 80 years. Generally, 
females live about ten years, on the average, longer than males. 

3. Curves that have tails to the right are said to be skewed to the right. The ages at the time of marriage 
of brides and grooms are skewed to the right. Most marry in their twenties and thirties but a few 
marry in their forties, fifties, sixties and seventies. 

4. Curves that have approximately equal frequencies across their values are said to be uniformly 
distributed. Certain machines that dispense liquid colas do so uniformly between 15.9 and 16.1 
ounces, for example. 

5. In a J -shaped or reverse J -shaped frequency curve the maximum occurs at one end or the other. 

6. A U-shaped frequency distribution curve has maxima at both ends and a minimum in between. 

7. A bimodal frequency curve has two maxima. 

8. A multimodal frequency curve has more than two maxima. 
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Solved Problems 

ARRAYS 

2.1 (a) Arrange the numbers 17, 45, 38, 27, 6, 48, 11, 57, 34, and 22 in an array. 
(b) Determine the range of these numbers. 

SOLUTION 

(a) In ascending order of magnitude, the array is: 6, 1 1, 17, 22, 27, 34, 38, 45, 48, 57. In descending order of 
magnitude, the array is: 57, 48, 45, 38, 34, 27, 22, 17, 11, 6. 

(b) Since the smallest number is 6 and the largest number is 57, the range is 57 - 6 = 51. 

2.2 The final grades in mathematics of 80 students at State University are recorded in the accom- 
panying table. 



68 


84 


75 


82 


68 


90 


62 


88 


76 


93 


73 


79 


88 


73 


60 


93 


71 


59 


85 


75 


61 


65 


75 


87 


74 


62 


95 


78 


63 


72 


66 


78 


82 


75 


94 


77 


69 


74 


68 


60 


96 


78 


89 


61 


75 


95 


60 


79 


83 


71 


79 


62 


67 


97 


78 


85 


76 


65 


71 


75 


65 


80 


73 


57 


88 


78 


62 


76 


53 


74 


86 


67 


73 


81 


72 


63 


76 


75 


85 


77 



With reference to this table, find: 

(a) The highest grade. 

(b) The lowest grade. 

(c) The range. 

(d) The grades of the five highest-ranking students. 

(e) The grades of the five lowest-ranking students. 
(/) The grade of the student ranking tenth highest. 

(g) The number of students who received grades of 75 or higher. 

(h) The number of students who received grades below 85. 

(0 The percentage of students who received grades higher than 65 but not higher than 85. 

(J) The grades that did not appear at all. 

SOLUTION 

Some of these questions are so detailed that they are best answered by first constructing an array. This 
can be done by subdividing the data into convenient classes and placing each number taken from the table 
into the appropriate class, as in Table 2.3, called an entry table. By then arranging the numbers of each class 
into an array, as in Table 2.4, the required array is obtained. From Table 2.4 it is relatively easy to answer 
the above questions. 

(a) The highest grade is 97. 

(b) The lowest grade is 53. 

(c) The range is 97 - 53 = 44. 

(d) The five highest-ranking students have grades 97, 96, 95, 95, and 94. 
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Table 2.3 



50-54 


53 


55-59 


59, 57 


60-64 


62, 60, 61, 62, 63, 60, 61, 60, 62, 62, 63 


65-69 


68, 68, 65, 66, 69, 68, 67, 65, 65, 67 


70-74 


73, 73, 71, 74, 72, 74, 71, 71, 73, 74, 73, 72 


75-79 
80-84 


75, 76, 79, 75, 75, 78, 78, 75, 77, 78, 75, 79, 79, 78, 76, 75, 78, 76, 76, 75, 77 
84, 82, 82, 83, 80, 81 


85-89 


88, 88, 85, 87, 89, 85, 88, 86, 85 


90-94 


90, 93, 93, 94 


95-99 


95, 96, 95, 97 


Table 2.4 


50-54 


53 


55-59 


57, 59 


60-64 


60, 60, 60, 61, 61, 62, 62, 62, 62, 63, 63 


65-69 


65, 65, 65, 66, 67, 67, 68, 68, 68, 69 


70-74 


71, 71, 71, 72, 72, 73, 73, 73, 73, 74, 74, 74 


75-79 


75, 75, 75, 75, 75, 75, 75, 76, 76, 76, 76, 77, 77, 78, 78, 78, 78, 78, 79, 79, 79 


80-84 


80, 81, 82, 82, 83, 84 


85-89 


85, 85, 85, 86, 87, 88, 88, 88, 89 


90-94 


90, 93, 93, 94 


95-99 


95, 95, 96, 97 



(e) The five lowest-ranking students have grades 53, 57, 59, 60, and 60. 

(/) The grade of the student ranking tenth highest is 88. 

(g) The number of students receiving grades of 75 or higher is 44. 

(h) The number of students receiving grades below 85 is 63. 

(0 The percentage of students receiving grades higher than 65 but not higher than 85 is 49/80 = 61.2%. 

(j) The grades that did not appear are through 52, 54, 55, 56, 58, 64, 70, 91, 92, 98, 99, and 100. 



FREQUENCY DISTRIBUTIONS, HISTOGRAMS, AND FREQUENCY POLYGONS 

2.3 Table 2.5 shows a frequency distribution of the weekly wages of 65 employees at the P&R 
Company. With reference to this table, determine: 

(a) The lower limit of the sixth class. 

(b) The upper limit of the fourth class. 

(c) The class mark (or class midpoint) of the third class. 

(d) The class boundaries of the fifth class. 

(e) The size of the fifth-class interval. 
(/) The frequency of the third class. 

(g) The relative frequency of the third class. 

(h) The class interval having the largest frequency. This is sometimes called the modal class 
interval; its frequency is then called the modal class frequency. 
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Table 2.5 





Number of 


Wages 


Employees 


$250.00-$259.99 


8 


260.00-269.99 


10 


270.00-279.99 


16 


280.00-289.99 


14 


290.00-299.99 


10 


300.00-309.99 


5 


310.00-319.99 


2 




Total 65 



(i) The percentage of employees earning less than $280.00 per week. 

(J) The percentage of employees earning less than $300.00 per week but at least $260.00 
per week. 

SOLUTION 

(a) $300.00. 

(b) $289.99. 

(c) The class mark of the third class = \ ($270.00 + $279.99) = $274,995. For most practical purposes, this 
is rounded to $275.00. 

(d) Lower class boundary of fifth class = \ ($290.00 + $289.99) = $289,995. Upper class boundary of fifth 
class = \ ($299.99 + $300.00) = $299,995. 

(<?) Size of fifth-class interval = upper boundary of fifth class - lower boundary of fifth class = 

$299,995 - $289,985 = $10.00. In this case all class intervals have the same size: $10.00. 
00 16. 

( g ) 16/65 = 0.246 = 24.6%. 

(h) $270.00-$279.99. 

(z) Total number of employees earning less than $280 per week = 16 + 10 + 8 = 34. Percentage of employ- 
ees earning less than $280 per week = 34/65 = 52.3%. 

(J) Number of employees earning less than $300.00 per week but at least $260 per week = 
10 + 14 + 16 + 10 = 50. Percentage of employees earning less than $300 per week but at least $260 
per week = 50/65 = 76.9%. 

2.4 If the class marks in a frequency distribution of the weights of students are 128, 137, 146, 155, 164, 
173, and 182 pounds (lb), find (a) the class-interval size, (b) the class boundaries, and (c) the class 
limits, assuming that the weights were measured to the nearest pound. 

SOLUTION 

(a) Class-interval size = common difference between successive class marks = 137 - 128 = 146 - 137 = 
etc. = 9 lb. 

(b) Since the class intervals all have equal size, the class boundaries are midway between the class marks 
and thus have the values 



|(128 + 137), \ (137 + 146), . . . ,±(173 + 182) or 132.5, 141.5, 150.5, . . . , 177.5 lb 
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The first class boundary is 132.5 - 9 = 123.5 and the last class boundary is 177.5 + 9 = 186.5, since 
the common class-interval size is 91b. Thus all the class boundaries are given by 

123.5, 132, 141.5, 150.5, 159.5, 168.5, 177.5, 186.5 lb 

(c) Since the class limits are integers, we choose them as the integers nearest to the class boundaries, 
namely, 123, 124, 132, 133, 141, 142, .... Thus the first class has limits 124-132, the next 133-141, etc. 

2.5 The time in hours spent on cell phones, per week, by 50 college freshmen was sampled and the 
SPSS pull-down "Analyze =4> Descriptive Statistics =>■ Frequencies" resulted in the following 
output in Fig. 2-4. 







Percent 


Valid Percent 


Cumulative 


Valid 3.00 


reqUenC 3 


— ~Tu~ 


_ai — ercen^ 


~To~ 


4.00 


3 


6.0 


6.0 


12.0 


5.00 


5 


10.0 


10.0 


22.0 


6.00 


3 


6.0 


6.0 


28.0 


7.00 


4 


8.0 


8.0 


36.0 


8.00 


4 


8.0 


8.0 


44.0 


9.00 


3 


6.0 


6.0 


50.0 


10.00 


4 


8.0 


8.0 


58.0 


11.00 


2 


4.0 


4.0 


62.0 


12.00 


2 


4.0 


4.0 


66.0 


13.00 


3 


6.0 


6.0 


72.0 


14.00 


1 


2.0 


2.0 


74.0 


15.00 


2 


4.0 


4.0 


78.0 


16.00 


5 


10.0 


10.0 


88.0 


17.00 


2 


4.0 


4.0 


92.0 


18.00 


1 


2.0 


2.0 


94.0 


19.00 


2 


4.0 


4.0 


98.0 


20.00 


1 


2.0 


2.0 


100.0 


Total 


50 


100.0 


100.0 





Fig. 2-4 SPSS output for Problem 2.5. 



(a) What percent spent 15 or less hours per week on a cell phone? 

(b) What percent spent more than 10 hours per week on a cell phone? 

SOLUTION 

(a) The cumulative percent for 15 hours is 78%. Thus 78% spent 15 hours or less on a cell phone per week. 

(b) The cumulative percent for 10 hours is 58%. Thus 58% spent 10 hours or less on a cell phone per week. 
Therefore, 42% spent more than 10 hours on a cell phone per week. 

2.6 The smallest of 150 measurements is 5. 18 in, and the largest is 7.44 in. Determine a suitable set of 
(a) class intervals, (b) class boundaries, and (c) class marks that might be used in forming a 
frequency distribution of these measurements. 

SOLUTION 

The range is 7.44 - 5.18 = 2.26 in. For a minimum of five class intervals, the class-interval size is 
2.26/5 = 0.45 approximately; and for a maximum of 20 class intervals, the class-interval size is 
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2.26/20 = 0.11 approximately. Convenient choices of class-interval sizes lying between 0.11 and 0.45 would 
be 0.20, 0.30, or 0.40. 

(a) Columns I, II, and III of the accompanying table show suitable class intervals, having sizes 0.20, 0.30, 
and 0.40, respectively. 

I II III 

5.10-5.29 5.10-5.39 5.10-5.49 

5.30-5.49 5.40-5.69 5.50-5.89 

5.50-5.69 5.70-5.99 5.90-6.29 

5.70-5.89 6.00-6.29 6.30-6.69 

5.90-6.09 6.30-6.59 6.70-7.09 

6.10-6.29 6.60-6.89 7.10-7.49 

6.30-6.49 6.90-7.19 

6.50-6.69 7.20-7.49 

6.70-6.89 

6.90-7.09 

7.10-7.29 

7.30-7.49 

Note that the lower limit in each first class could have been different from 5.10; for example, if in column 
I we had started with 5.15 as the lower limit, the first class interval could have been written 5.15-5.34. 

(b) The class boundaries corresponding to columns I, II, and III of part (a) are given, respectively, by 

I 5.095-5.295, 5.295-5.495, 5.495-5.695, . . . , 7.295-7.495 

II 5.095-5.395, 5.395-5.695, 5.695-5.995, . . . , 7.195-7.495 

III 5.095-5.495, 5.495-5.895, 5.895-6.295, . . . , 7.095-7.495 

Note that these class boundaries are suitable since they cannot coincide with the observed measure- 
ments. 

(c) The class marks corresponding to columns I, II, and III of part (a) are given, respectively, by 

I 5.195, 5.395,..., 7.395 II 5.245, 5.545, 7.345 III 5.295, 5.695, 7.295 
These class marks have the disadvantage of not coinciding with the observed measurements. 



2.7 In answering Problem 2.6(a), a student chose the class intervals 5.10-5.40, 5.40-5.70,..., 
6.90-7.20, and 7.20-7.50. Was there anything wrong with this choice? 



SOLUTION 

These class intervals overlap at 5.40, 5.70, . . . , 7.20. Thus a measurement recorded as 5.40, for example, 
could be placed in either of the first two class intervals. Some statisticians justify this choice by agreeing to 
place half of such ambiguous cases in one class and half in the other. 

The ambiguity is removed by writing the class intervals as 5.10 to under 5.40, 5.40 to under 5.70, etc. 
In this case, the class limits coincide with the class boundaries, and the class marks can coincide with 
the observed data. 

In general it is desirable to avoid overlapping class intervals whenever possible and to choose them 
so that the class boundaries are values not coinciding with actual observed data. For example, the 
class intervals for Problem 2.6 could have been chosen as 5.095-5.395, 5.395-5.695, etc., without 
ambiguity. A disadvantage of this particular choice is that the class marks do not coincide with the 
observed data. 
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2.8 In the following table the weights of 40 male students at State University are recorded to the 
nearest pound. Construct a frequency distribution. 



138 


164 


150 


132 


144 


125 


149 


157 


146 


158 


140 


147 


136 


148 


152 


144 


168 


126 


138 


176 


163 


119 


154 


165 


146 


173 


142 


147 


135 


153 


140 


135 


161 


145 


135 


142 


150 


156 


145 


128 



SOLUTION 

The largest weight is 1761b and the smallest weight is 1191b, so that the range is 176 - 1 19 = 57 lb. 
If five class intervals are used, the class-interval size is 57/5 = 11 approximately; if 20 class intervals are 
used, the class-interval size is 57/20 = 3 approximately. 

One convenient choice for the class-interval size is 51b. Also, it is convenient to choose the class 
marks as 120, 125, 130, 135,... lb. Thus the class intervals can be taken as 118-122, 123-127, 
128-132,.... With this choice the class boundaries are 117.5, 122.5, 127.5,..., which do not coincide 
with the observed data. 

The required frequency distribution is shown in Table 2.6. The center column, called a tally, or score 
sheet, is used to tabulate the class frequencies from the raw data and is usually omitted in the final pre- 
sentation of the frequency distribution. It is unnecessary to make an array, although if it is available it can be 
used in tabulating the frequencies. 

Another method 

Of course, other possible frequency distributions exist. Table 2.7, for example, shows a frequency 
distribution with only seven classes in which the class interval is 91b. 



Weight (lb) 


Tally 


Frequency 


118-122 


/ 


1 


123-127 


// 




128-132 


// 


2 


133-137 


//// 


4 


138-142 


mi 


6 


143-147 


Will 


8 


148-152 
153-157 


m 
mi 


5 
4 


158-162 


ii 


2 


163-167 


in 


3 


168-172 


i 


1 


173-177 


ii 


2 




Total 40 



Weight (lb) 


Tally 


Frequency 


118-126 


/// 


3 


127-135 


m 


5 


136-144 


W llll 


9 


145-153 


w m n 


12 


154-162 


mi 


5 


163-171 


mi 


4 


172-180 


n 


2 




Total 40 



2.9 In the following, the heights of 45 female students at Midwestern University are recorded to the 
nearest inch. Use the statistical package, STATISTIX, to construct a histogram. 

67 67 64 64 74 61 68 71 69 61 65 64 62 63 59 
70 66 66 63 59 64 67 70 65 66 66 56 65 67 69 
64 67 68 67 67 65 74 64 62 68 65 65 65 66 67 
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SOLUTION 

After entering the data into the STATISTIX worksheet, the STATISTIX pull-down "Statistics => 
Summary Statistics => Histogram" produces the histogram of female heights in Fig. 2-5. 



27 



55 59 63 67 71 75 

Height (in) 

Fig. 2-5 Histogram of 45 female heights produced by STATISTIX. 



2.10 Table 2.8 gives the one-way commuting distances for 50 Metropolitan College students in miles. 



Table 2.8 Commuting distances from Metropolitan College (miles) 



The SPSS histogram for the commuting distances in Table 2.8 is shown in Fig. 2-6. Note that the 
classes are to 2, 2 to 4, 4 to 6, 6 to 8, 8 to 10, 10 to 12, 12 to 14, and 14 to 16. The frequencies are 
7, 13, 11, 7, 6, 3, 2 and 1. A number is counted in the class if it falls on the lower class limit but 
in the next class if it falls on the upper class limit. 



(</) 


What data values 




in 


the first class? 


(/') 


What data values 




in 


the second class? 


(c) 


What data values 






the third class? 


(d) 


What data values 




in 


the fourth class? 


(e) 


What data values 


are 


in 


the fifth class? 


GO 


What data values are 


n 


the sixth class? 


(M) 


What data values 


are 


in 


the seventh class? 


(//) 


What data values 


are 


in 


the eighth class? 
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Mean = 5.368 
Std. Dev. = 3.36515 
N = 50 



0.00 5.00 10.00 15.00 20.00 
Distance (miles) 



Fig. 2-6 SPSS histogram for 



g distances from Metropolitan College. 



SOLUTION 

(a) 0.9, 0.9, 1.0, 1.1, 1.4, 1.6, 1.9 

(b) 2.0, 2.0, 2.2, 2.4, 2.6, 3.0, 3.2, 3.3, 3.7, 3.8, 3.8, 3.9, 3.9 

(c) 4.0, 4.2, 4.3, 4.3, 4.4, 4.4, 4.6, 4.8, 4.8, 4.9, 5.0 

(d) 6.2, 6.5, 6.6, 7.0, 7.2, 7.7, 7.8 

(e) 8.0, 8.0, 8.0, 8.4, 8.7, 8.8 

(f) 10.0, 10.3, 10.3 
fe) 12.3, 12.6 

(h) 15.7 

2.11 The SAS histogram for the commuting distances in Table 2.8 is shown in Fig. 2-7. The midpoints 
of the class intervals are shown. The classes are to 2.5, 2.5 to 5.0, 5.0 to 7.5, 7.5 to 10.0, 10.0 to 
12.5, 12.5 to 15.0, 15.0 to 17.5, and 17.5 to 20.0. A number is counted in the class if it falls on the 
lower class limit but in the next class if it falls on the upper class limit. 

(a) What data values (in Table 2.8) are in the first class? 

(b) What data values are in the second class? 

(c) What data values are in the third class? 

(d) What data values are in the fourth class? 

(e) What data values are in the fifth class? 
if) What data values are in the sixth class? 

(g) What data values are in the seventh class? 

SOLUTION 

(a) 0.9, 0.9, 1.0, 1.1, 1.4, 1.6, 1.9, 2.0, 2.0, 2.2, 2.4 

(b) 2.6, 3.0, 3.2, 3.3, 3.7, 3.8, 3.8, 3.9, 3.9, 4.0, 4.2, 4.3, 4.3, 4.4, 4.4, 4.6, 4.8, 4.8, 4.9 

(c) 5.0, 6.2, 6.5, 6.6, 7.0, 7.2 

(d) 7.7, 7.8, 8.0, 8.0, 8.0, 8.4, 8.7, 8.8 
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(e) 10.0, 10.3, 10.3, 12.3 
03 12.6 
(g) 15.7 

20.0 - — 

17.5 - 

15.0 - 

12.5 - 

o 10.0- 
o 

7.5 - 



5.0 
2.5 



1.25 3.75 6.25 8.75 11.25 13.75 16.25 
Distance (miles) 

Fig. 2-7 SAS histogram for commuting distances from Metropolitan College. 



2.12 At the P&R Company (Problem 2.3), five new employees were hired at weekly wages of $285.34, 
$316.83, $335.78, $356.21, and $374.50. Construct a frequency distribution of wages for the 
70 employees. 



SOLUTION 

Possible frequency distributions are shown in Table 2.9. 



Wages 


Frequency 


$250.00-5259.99 




260.00-269.99 


10 


270.00-279.99 


16 


280.00-289.99 


15 


290.00-299.99 


10 


300.00-309.99 


5 


310.00-319.99 


3 


320.00-329.99 





330.00-339.99 


1 


340.00-349.99 





350.00-359.99 


1 


360.00-369.99 





370.00-379.99 


1 




Total 70 



Wages 


Frequency 


S250.00-S259.99 


8 


260.00-269.99 


10 


270.00-279.99 


16 


280.00-289.99 


15 


290.00-299.99 


10 


300.00-309.99 


5 


310.00-319.99 


3 


320.00 and over 


3 




Total 70 



CHAP. 2] 



FREQUENCY DISTRIBUTIONS 



51 



Table 2.9(c) 



Wages 


Frequency 


S250.00-S269.99 
270.00-289.99 
290.00-309.99 
310.00-329.99 
330.00-349.99 
350.00-369.99 
370.00-389.99 


18 
31 
15 

3 
1 
1 
1 



Total 70 



Table 2.9(d) 



Wages 


Frequency 


S250.00-S259.99 


8 


260.00-269.99 


10 


270.00-279.99 


16 


280.00-289.99 


15 


290.00-299.99 


10 


300.00-319.99 


8 


320.00-379.99 


3 



Total 70 



In Table 2.9(a), the same class-interval size, $10.00, has been maintained throughout the table. As a 
result, there are too many empty classes and the detail is much too fine at the upper end of the wage scale. 

In Table 2.9(b), empty classes and fine detail have been avoided by use of the open class interval 
"$320.00 and over." A disadvantage of this is that the table becomes useless in performing certain 
mathematical calculations. For example, it is impossible to determine the total amount of wages paid 
per week, since "over $320.00" might conceivably imply that individuals could earn as high as $1400.00 
per week. 

In Table 2.9(c), a class-interval size of $20.00 has been used. A disadvantage is that much information is 
lost at the lower end of the wage scale and the detail is still fine at the upper end of the scale. 

In Table 2.9(d), unequal class-interval sizes have been used. A disadvantage is that certain mathematical 
calculations to be made later lose a simplicity that is available when class intervals have the same size. 
Also, the larger the class-interval size, the greater will be the grouping error. 



2.13 The EXCEL histogram for the commuting distances in Table 2.8 is shown in Fig. 2-8. The classes 
are to 3, 3 to 6, 6 to 9, 9 to 12, 12 to 15, and 15 to 18. A number is counted in the class if it falls 
on the upper class limit but in the prior class if it falls on the lower class limit. 

(a) What data values (in Table 2.8) are in the first class? 

(b) What data values are in the second class? 

(c) What data values are in the third class? 




Fig. 2-8 EXCEL histogram 



Distance (miles) 

for commuting distances from Metropolitan College. 
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(d) What data values are in the fourth class? 

(e) What data values are in the fifth class? 

(f) What data values are in the sixth class? 



SOLUTION 














(a) 0.9,0.9,1.0,1.1 


1.4 


1.6, 


1.9, 2.0, 2.0, 2.2, 


2.4, 


2.6, 


3.0 


(b) 3.2, 3.3, 3.7, 3.8 


3.8 


3.9, 


3.9, 4.0, 4.2, 4.3, 


4.3, 


4.4. 


4.4. 


(c) 6.2, 6.5, 6.6, 7.0, 


7.2, 


7.7, 


7.8, 8.0, 8.0, 8.0, 


8.4. 


8.7, 


8.8 


(d) 10.0, 10.3, 10.3 














(e) 12.3, 12.6 














00 15.7 















CUMULATIVE-FREQUENCY DISTRIBUTIONS AND OGIVES 

2.14 Construct (a) a cumulative-frequency distribution, (b) a percentage cumulative distribution, 
(c) an ogive, and (d) a percentage ogive from the frequency distribution in Table 2.5 of 
Problem 2.3. 



Table 2.10 







Percentage 




Cumulative 


Cumulative 


Wages 


Frequency 


Distribution 


Less than $250.00 





0.0 


Less than $260.00 


8 


12.3 


Less than $270.00 


18 


27.7 


Less than $280.00 


34 


52.3 


Less than $290.00 


48 


73.8 


Less than $300.00 


58 


89.2 


Less than $310.00 


63 


96.9 


Less than $320.00 


65 


100.0 



SOLUTION 

(a) and (b) The cumulative-frequency distribution and percentage cumulative distribution (or cumu- 
lative relative-frequency distribution) are shown in Table 2.10. 

Note that each entry in column 2 is obtained by adding successive entries from column 2 of Table 2.5, 
thus, 18 = 8 + 10, 34 = 8 + 10 + 16, etc. 

Each entry in column 3 is obtained from the previous column by dividing by 65, the total frequency, and 
expressing the result as a percentage. Thus, 34/65 = 52.3%. Entries in this column can also be obtained by 
adding successive entries from column 2 of Table 2.8. Thus, 27.7 = 12.3 + 15.4, 52.3 = 12.3 + 15.4 + 24.6, 

(c) and (d) The ogive (or cumulative-frequency polygon) is shown in Fig. 2-9(a) and the percentage 
ogive (relative cumulative-frequency polygon) is shown in Fig. 2-9(b). Both of these are Minitab generated 
plots. 
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(£>) Wages 
250 260 270 280 290 300 310 320 



70 
60 
50 
40 
30 
20 
10 


250 260 270 280 290 300 310 320 
(a) Wages 

Fig. 2-9 (a) MINITAB cumulative frequency and (b) percentage cumulative frequency graphs. 

2.15 From the frequency distribution in Table 2.5 of Problem 2.3, construct (a) an "or more" 
cumulative-frequency distribution and (b) an "or more" ogive. 

SOLUTION 

(a) Note that each entry in column 2 of Table 2.11 is obtained by adding successive entries from column 2 
of Table 2.5, starting at the bottom of Table 2.5; thus 7 = 2 + 5, 17 = 2 + 5 + 10, etc. These entries can 
also be obtained by subtracting each entry in column 2 of Table 2.10 from the total frequency, 65; 
thus 57 = 65 - 8, 47 = 65 - 18, etc. 

(b) Figure 2-10 shows the "or more" ogive. 



Table 2.11 



Wages 


"Or More" 
Cumulative Frequency 


$250.00 or more 


65 


260.00 or more 


57 


270.00 or more 


47 


280.00 or more 


31 


290.00 or more 


17 


300.00 or more 


7 


310.00 or more 


2 


320.00 or more 







2.16 From the ogives in Figs. 2-9 and 2-10 (of Problems 2.14 and 2.15, respectively), estimate the 
number of employees earning (a) less than $288.00 per week, (b) $296.00 or more per week, and 
(c) at least $263.00 per week but less than $275.00 per week. 

SOLUTION 

(a) Referring to the "less than" ogive of Fig. 2-9, construct a vertical line intersecting the "Wages" axis at 
$288.00. This line meets the ogive at the point with coordinates (288, 45); hence 45 employees earn less 
than $288.00 per week. 

(b) In the "or more" ogive of Fig. 2-10, construct a vertical line at $296.00. This line meets the ogive at the 
point (296, 11); hence 11 employees earn $296.00 or more per week. 
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70 




250 260 270 280 290 300 310 320 330 340 350 
Wages 

Fig. 2-10 EXCEL produced "Or more" cumulative frequency graph. 

This can also be obtained from the "less than" ogive of Fig. 2.9. By constructing a line at $296.00, 
we find that 54 employees earn less than $296.00 per week; hence 65 — 54 = 11 employees earn $296.00 
or more per week. 

(c) Using the "less than" ogive of Fig. 2-9, we have: Required number of employees = number earning less 
than $275.00 per week - number earning less than $263.00 per week = 26 — 1 1 = 15. 
Note that the above results could just as well have been obtained by the process of interpolation in the 
cumulative-frequency tables. In part (a), for example, since $288.00 is 8/10, or 4/5, of the way between 
$280.00 and $290.00, the required number of employees should be 4/5 of the way between the corresponding 
values 34 and 48 (see Table 2.10). But 4/5 of the way between 34 and 48 is | (48 — 34) = 11. Thus the 
required number of employees is 34 + 1 1 = 45. 

2.17 Five pennies were tossed 1000 times, and at each toss the number of heads was observed. The 
number of tosses during which 0, 1, 2, 3, 4, and 5 heads were obtained is shown in Table 2.12. 

(a) Graph the data of Table 2.12. 

(b) Construct a table showing the percentage of tosses resulting in a number of heads less than 0, 
1, 2, 3, 4, 5, or 6. 

(c) Graph the data of the table in part (b). 



Table 2.12 



Number of Heads 


Number of Tosses 
(frequency) 





38 


1 


144 


2 


342 


3 


287 


4 


164 


5 


25 




Total 1000 
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(a) The data can be shown graphically either as in Fig. 2-1 
Figure 2-11 seems to be a more natural graph to u 
cannot be 1.5 or 3.2. The graph is called a dotplot. It is 



1 or as in Fig. 2-12. 

se — since, for example, the number of heads 
especially used when the data are discrete. 



Heads 

Each symbol represents up to 9 observations. 

Fig. 2-11 MINITAB dotplot of the number of heads. 



| 201 
J 151 



Heads 

Fig. 2-12 MINITAB histogram of the number of heads. 



Figure 2-12 shows a histogram of the data. Note that the total area of the histogram is the total 
frequency 1000, as it should be. In using the histogram representation or the corresponding 
frequency polygon, we are essentially treating the data as {/'they were continuous. This will later be 
found useful. Note that we have already used the histogram and frequency polygon for discrete data 
in Problem 2.10. 

(b) Referring to the required Table 2.13, note that it shows simply a cumulative-frequency distribution and 
percentage cumulative distribution of the number of heads. It should be observed that the entries "Less 
than 1," "Less than 2," etc., could just as well have been "Less than or equal to 0," "Less than or equal 

(c) The required graph can be presented either as in Fig. 2-13 or as in Fig. 2-14. 

Figure 2-13 is the most natural for presenting discrete data — since, for example, the percentage of 
tosses in which there will be less than 2 heads is equal to the percentage in which there will be less than 
1.75, 1.56, or 1.23 heads, so that the same percentage (18.2%) should be shown for these values 
(indicated by the horizontal line). 
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Table 2.13 



Number of Heads 


Number of Tosses 
(cumulative frequency) 


Percentage 
Number of Tosses 
(percentage 
cumulative frequency) 


Less than 






Less than 1 


38 


3.8 


Less than 2 


182 


18.2 


Less than 3 


524 


52.4 


Less than 4 


811 


81.1 


Less than 5 


975 


97.5 


Less than 6 


1000 


100.0 



Figure 2-14 shows the cumulative-frequency polygon, or ogive, for the data and essentially treats 
the data as if they were continuous. 

Note that Figs. 2-13 and 2-14 correspond, respectively, to Figs. 2-11 and 2-12 of part (a). 




1 2 3 4 5 6 7 

Heads 

Fig. 2-13 MINITAB step function. 




1 2 3 4 5 6 

Heads 

Fig. 2-14 MINITAB cumulative frequency polygon. 
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FREQUENCY CURVES AND SMOOTHED OGIVES 

2.18 Samples from populations have histograms and frequency polygons that have certain shapes. If 
the samples are extremely large, the histograms and frequency polygons approach the frequency 
distribution of the population. Consider two such population frequency distributions, (a) Con- 
sider a machine that fills cola containers uniformly between 15.9 and 16.1 ounces. Construct the 
frequency curve and determine what percent put more than 15.95 ounces in the containers. 
(b) Consider the heights of females. They have a population frequency distribution that is 
symmetrical or bell shaped with an average equal to 65 in and standard deviation equal 3 in. 
(Standard deviation is discussed in a later chapter.) What percent of the heights are between 62 
and 68 in tall, that is are within one standard deviation of the mean? What percent are within two 
standard deviations of the mean? What percent are within three standard deviations of the mean? 

SOLUTION 

Figure 2-15 shows a uniform frequency curve. The cross-hatched region corresponds to fills of more 
than 15.95 ounces. Note that the region covered by the frequency curve is shaped as a rectangle. The area 
under the frequency curve is length x width or (16.10 - 15.90) x 5 = 1. The area of the cross hatched region 
is (16.10 - 15.90) x 5 = 0.75. This is interpreted to mean that 75% of the machine fills exceed 15.95 ounces. 

Figure 2-16 shows a symmetrical or bell-shaped frequency curve. It shows the heights within one 
standard deviation as the cross-hatched region. Calculus is used to find this area. The area within one 




Fig. 2-15 MINITAB uniform frequency 




56 59 62 65 68 71 74 
Height (inches) 

Fig. 2-16 MINITAB bell shaped frequency curve showing heights between 62 and 68 in and their 
frequencies. 
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standard deviation is approximately 68% of the total area under the curve. The area within two standard 
deviations is approximately 95% of the total area under the curve. The area within three standard deviations 
is approximately 99.7% of the total area under the curve. 

We shall have more to say about finding areas under these curves in later chapters. 



Supplementary Problems 

2.19 (a) Arrange the numbers 12, 56, 42, 21, 5, 18, 10, 3, 61, 34, 65, and 24 in an array and (b) determine the 
range. 

2.20 Table 2.14 shows the frequency distribution for the number of minutes per week spent watching TV by 400 
junior high students. With reference to this table determine: 

(a) The upper limit of the fifth class 

(b) The lower limit of the eighth class 

(c) The class mark of the seventh class 

(d) The class boundaries of the last class 
(<?) The class-interval size 

(/) The frequency of the fourth class 

(g) The relative frequency of the sixth class 

(h) The percentage of students whose weekly viewing time does not exceed 600 minutes 

(i) The percentage of students with viewing times greater than or equal to 900 minutes 

(J) The percentage of students whose viewing times are at least 500 minutes but less than 1000 minutes 



Table 2.14 



Viewing Time 


Number of 


(minutes) 


Students 


300-399 


14 


400^199 


46 


500-599 


58 


600-699 


76 


700-799 


68 


800-899 


62 


900-999 


48 


1000-1099 


22 


1100-1199 


6 



2.21 Construct (a) a histogram and (b) a frequency polygon corresponding to the frequency distribution of 
Table 2.14. 



2.22 For the data in Table 2.14 of Problem 2.20, construct (a) a relative-frequency distribution, (b) a relative- 
frequency histogram, and (c) a relative-frequency polygon. 
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2.23 For the data in Table 2.14, construct (a) a cumulative-frequenc) distribution, (b) a percentage cumulative 
distribution, (c) an ogive, and (d) a percentage ogive. (Note that unless otherwise specified, a cumulative 
distribution refers to one made on a "less than" basis.) 

2.24 Work Problem 2.23 for the case where the frequencies are cumulative on an "or more" basis. 

2.25 Using the data in Table 2.14, estimate the percentage of students that have viewing times of (a) less than 560 
minutes per week, (b) 970 or more minutes per week, and (c) between 620 and 890 minutes per week. 

2.26 The inner diameters of washers produced by a company can be measured to the nearest thousandth of an 
inch. If the class marks of a frequency distribution of these diameters are given in inches by 0.321, 0.324, 
0.327, 0.330, 0.333, and 0.336, find (a) the class-interval size, (b) the class boundaries, and (c) the class limits. 

2.27 The following table shows the diameters in centimeters of a sample of 60 ball bearings manufactured by a 
company. Construct a frequency distribution of the diameters, using appropriate class intervals. 

1.738 1.729 1.743 1.740 1.736 1.741 1.735 1.731 1.726 1.737 

1.728 1.737 1.736 1.735 1.724 1.733 1.742 1.736 1.739 1.735 

1.745 1.736 1.742 1.740 1.728 1.738 1.725 1.733 1.734 1.732 

1.733 1.730 1.732 1.730 1.739 1.734 1.738 1.739 1.727 1.735 

1.735 1.732 1.735 1.727 1.734 1.732 1.736 1.741 1.736 1.744 

1.732 1.737 1.731 1.746 1.735 1.735 1.729 1.734 1.730 1.740 

2.28 For the data of Problem 2.27 construct (a) a histogram, (b) a frequency polygon, (c) a relative-frequency 
distribution, (d) a relative-frequency histogram, (e) a relative-frequency polygon (f ) a cumulative-frequency 
distribution, (g) a percentage cumulative distribution, (h) an ogive, and (/) a percentage ogive. 

2.29 From the results in Problem 2.28, determine the percentage of ball bearings having diameters (a) exceeding 
1.732 cm, (b) not more than 1.736 cm, and (c) between 1.730 and 1.738 cm. Compare your results with those 
obtained directly from the raw data of Problem 2.27. 

2.30 Work Problem 2.28 for the data of Problem 2.20. 

2.31 According to the U.S. Bureau of the Census, Current Population Reports, the 1996 population of the 
United States is 265,284,000. Table 2.15 gives the percent distribution for various age groups. 

What is the width, or size, of the second class interval? The fourth class interval? 
b) How many different class-interval sizes are there? 

How many open class intervals are there? 
[d) How should the last class interval be written so that its class width will equal that of the next to last 
class interval? 

e) What is the class mark of the second class interval? The fourth class interval? 
/) What are the class boundaries of the fourth class interval? 

(g) What percentage of the population is 35 years of age or older? What percentage of the population is 64 
or younger? 

li) What percentage of the ages are between 20 and 49 inclusive? 
What percentage of the ages are over 70 years? 

a) Why is it impossible to construct a percentage histogram or frequency polygon for the distribution in 
Table 2.15? 

b) How would you modify the distribution so that a percentage histogram and frequency polygon could 
be constructed? 

) Perform the construction using the modification in part (b). 
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A e rou in ears 
ge group in years 


% of U.S. 


Under 5 


7.3 


5-9 


7.3 


10-14 


7.2 


15-19 


7.0 


20-24 


6.6 


25-29 


7.2 


30-34 


8.1 


35-39 


8.5 


40^14 


7.8 


45^19 


6.9 


50-54 


5.3 


55-59 


4.3 


60-64 


3.8 


65-74 


7.0 


75-84 


4.3 


85 and over 


1.4 



2.33 Refer to Table 2.15. Assume that the total population is 265 million and that the class "Under 5" contains 
babies who are not yet 1 year old. Determine the number of individuals in millions to one decimal point in 
each age group. 

[a) Construct a smoothed percentage frequency polygon and smoothed percentage ogive corresponding to 
the data in Table 2.14. 

[b) From the results in part (a), estimate the probability that a student views less than 10 hours of TV 
per week. 

) From the results in part (a), estimate the probability that a student views TV for 15 or more hours 
per week. 

d) From the results in part (a), estimate the probability that a student views TV for less than 5 hours 
per week. 

a) Toss four coins 50 times and tabulate the number of heads at each toss. 

\b) Construct a frequency distribution showing the number of tosses in which 0, 1,2, 3, and 4 heads 
appeared. 

'c) Construct a percentage distribution corresponding to part (b). 

d) Compare the percentage obtained in part (c) with the theoretical ones 6.25%, 25%, 37.5%, 25%, and 

6.25% (proportional to 1, 4, 6, 4, and 1) arrived at by rules of probability. 
') Graph the distributions in parts (b) and (c). 
'/) Construct a percentage ogive for the data. 



2.36 Work Problem 2.35 with 50 more tosses of the four coins and see if the experiment is more in agreement with 
theoretical expectation. If not, give possible reasons for the differences. 




CHAPTER 3 




The Mean, Median, 
Mode, and Other 
Measures of 
Central Tendency 



INDEX, OR SUBSCRIPT, NOTATION 

Let the symbol X f (read "X sub/') denote any of the N values X u X 2 , X 3 ,...,X N assumed by a 
variable X. The letter j in Xj, which can stand for any of the numbers 1, 2, 3, . . . , N is called a subscript, 
or index. Clearly any letter other than j, such as i, k, p, q, or s, could have been used as well. 

SUMMATION NOTATION 

The symbol J2jL\ Xj ' s use d to denote the sum of all the X/s from j = 1 to j = N; by definition, 

TV 

Xj = x x + x 2 + x 3 + ■ ■ ■ + X N 

When no confusion can result, we often denote this sum simply by J2 X, Xj, or £\. Xj. The symbol Yl, 
is the Greek capital letter sigma, denoting sum. 

EXAMPLE 1. ^ XjYj = XiYi + X 2 Y 2 + X 3 Y 3 + ■ ■ ■ + X N Y N 

EXAMPLE 2. aX i = aX l +aX 2 + --- + aX N = a(X, + X 2 + ■ • • + X N ) = Xj 
where a is a constant. More simply, Yl Q X = a X. 

EXAMPLE 3. If a, b, and c are any constants, then £ {aX+bY- cZ) = aJ2 X + bJ2 Y - cJ2 Z. See Problem 3.3. 
61 
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AVERAGES, OR MEASURES OF CENTRAL TENDENCY 

An average is a value that is typical, or representative, of a set of data. Since such typical values tend 
to lie centrally within a set of data arranged according to magnitude, averages are also called measures of 
central tendency. 

Several types of averages can be defined, the most common being the arithmetic mean, the median, 
the mode, the geometric mean, and the harmonic mean. Each has advantages and disadvantages, depend- 
ing on the data and the intended purpose. 



THE ARITHMETIC MEAN 

The arithmetic mean, or briefly the mean, of a set of N numbers X\, X 2 , Z 3 , . . . , X N is denoted by X 
(read "X bar") and is defined as 

N 

-_ x l +x 2 + x 3 + --- + x N _ j= l In 



EXAMPLE 4. The arithmetic mean of the numbers 8, 3, 5, 12, and 10 is 

_ 8 + 3 + 5+ 12+ 10 38 nr 

x = 5 = T = 7 - 6 

If the numbers X u X 2 , ■ ■ ■ , X K occur /!, f 2 , . . . , f K times, respectively (i.e., occur with frequencies 
f, f 2 ,..., f£), the arithmetic mean is 

K 

E fj x J 

y^ f\x x +f 2 x 2 + --- + f K x K ^ U " ' JLf x = T.fx ,~ 
/1 + /2 + ■■■ + f K * £/ n u 

where N = J2f is the total frequency (i.e., the total number of cases). 

EXAMPLE 5. If 5, 8, 6, and 2 occur with frequencies 3, 2, 4, and 1, respectively, the arithmetic mean is 
= (3)(5) + (2)(8) + (4)(6) + (1)(2) ^ 15 + 16 + 24 + 2 
3 + 2 + 4+1 10 



THE WEIGHTED ARITHMETIC MEAN 

Sometimes we associate with the numbers X U X 2 ,...,X K certain weighting factors (or weights) 
w x ,w 2 ,..., w K , depending on the significance or importance attached to the numbers. In this case, 

W! X { + w 2 X 2 + ■■■ + w K X k _ £ wX 
X - w,+w 2 + ... + w, (J) 

is called the weighted arithmetic mean. Note the similarity to equation (2), which can be considered a 
weighted arithmetic mean with weights f, f 2 , . . . , f K . 

EXAMPLE 6. If a final examination in a course is weighted 3 times as much as a quiz and a student has a final 
examination grade of 85 and quiz grades of 70 and 90, the mean grade is 

^ (l)(70) + (l)(90) + (3)(85) ^415 = g3 
1+1+3 5 
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PROPERTIES OF THE ARITHMETIC MEAN 

1. The algebraic sum of the deviations of a set of numbers from their arithmetic mean is zero. 

EXAMPLE 7. The deviations of the numbers 8, 3, 5, 12, and 10 from their arithmetic mean 7.6 are 8 - 7.6, 3 - 7.6, 
5 - 7.6, 12 - 7.6, and 10 - 7.6, or 0.4, -4.6, -2.6, 4.4, and 2.4, with algebraic sum 0.4 - 4.6 - 2.6 + 4.4 + 2.4 = 0. 

2. The sum of the squares of the deviations of a set of numbers 2) from any number a is a minimum if 
and only if a — X (see Problem 4.27). 

3. If /i numbers have mean m u f 2 numbers have mean m 2 , . . . , f K numbers have mean m K , then the 
mean of all the numbers is 

y _ f\ m \ +fl m 2 + ' ' ' +.fK m K / v\ 

/1+/2+ ■■■+/* [> 

that is, a weighted arithmetic mean of all the means (see Problem 3.12). 

4. If A is any guessed or assumed arithmetic mean (which may be any number) and if dj = Xj — A are the 
deviations of Xj from A, then equations (1) and (2) become, respectively, 



£4 



Em- 

where TV = Yf=\ fj = J2f- Note that formulas (5) and (6) are summarized in the equation 
X = A + d (see Problem 3.18). 



THE ARITHMETIC MEAN COMPUTED FROM GROUPED DATA 

When data are presented in a frequency distribution, all values falling within a given class interval 
are considerd to be coincident with the class mark, or midpoint, of the interval. Formulas (2) and (6) are 
valid for such grouped data if we interpret Xj as the class mark,/} as its corresponding class frequency, A 
as any guessed or assumed class mark, and dj = Xj — A as the deviations of 2} from A. 

Computations using formulas (2) and (<5) are sometimes called the long and short methods, respec- 
tively (see Problems 3.15 and 3.20). 

If class intervals all have equal size c, the deviations dj = Xj — A can all be expressed as cuj, where Uj 
can be positive or negative integers or zero (i.e., 0, ±1, ±2, ±3, . . .), and formula (6) becomes 

which is equivalent to the equation X = A + cu (see Problem 3.21). This is called the coding method for 
computing the mean. It is a very short method and should always be used for grouped data where the 
class-interval sizes are equal (see Problems 3.22 and 3.23). Note that in the coding method the values of 
the variable X are transformed into the values of the variable u according to X = A + cu. 
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THE MEDIAN 

The median of a set of numbers arranged in order of magnitude (i.e., in an array) is either the middle 
value or the arithmetic mean of the two middle values. 

EXAMPLE 8. The set of numbers 3, 4, 4, 5, 6, 8, 8, 8, and 10 has median 6. 

EXAMPLE 9. The set of numbers 5, 5, 7, 9, 11, 12, 15, and 18 has median 1(9+ 11) = 10. 

For grouped data, the median, obtained by interpolation, is given by 



where L l = lower class boundary of the median class (i.e., the class containing the median) 
N = number of items in the data (i.e., total frequency) 
(Hf)i — sum °f frequencies of all classes lower than the median class 
./median = frequency of the median class 
c = size of the median class interval 

Geometrically the median is the value of X (abscissa) corresponding to the vertical line which divides 
a histogram into two parts having equal areas. This value of X is sometimes denoted by X. 

THE MODE 

The mode of a set of numbers is that value which occurs with the greatest frequency; that is, it is the 
most common value. The mode may not exist, and even if it does exist it may not be unique. 

EXAMPLE 10. The set 2, 2, 5, 7, 9, 9, 9, 10, 10, 11, 12, and 18 has mode 9. 

EXAMPLE 11. The set 3, 5, 8, 10, 12, 15, and 16 has no mode. 

EXAMPLE 12. The set 2, 3, 4, 4, 4, 5, 5, 7, 7, 7, and 9 has two modes, 4 and 7, and is called bimodal. 

A distribution having only one mode is called unimodal. 

In the case of grouped data where a frequency curve has been constructed to fit the data, the mode 
will be the value (or values) of X corresponding to the maximum point (or points) on the curve. 
This value of X is sometimes denoted by X. 

From a frequency distribution or histogram the mode can be obtained from the formula 



where L x = lower class boundary of the modal class (i.e., the class containing the mode) 
A, — excess of modal frequency over frequency of next-lower class 
A 2 = excess of modal frequency over frequency of next-higher class 
c = size of the modal class interval 



THE EMPIRICAL RELATION BETWEEN THE MEAN, MEDIAN, AND MODE 

For unimodal frequency curves that are moderately skewed (asymmetrical), we have the empirical 
relation 





(9) 



Mean - mode = 3 (mean - median) 



(10) 
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Figures 3-1 and 3-2 show the relative positions of the mean, median, and mode for frequency curves 
skewed to the right and left, respectively. For symmetrical curves, the mean, mode, and median all 
coincide. 



o 




Mode Median Mean 



Fig. 3-1 Relative positions of mode, median, and mean for a right-skewed frequency curve. 








Mean Median Mode 



Fig. 3-2 Relative positions of mode, median, and mean for a left-skewed frequency curve. 



THE GEOMETRIC MEAN G 

The geometric mean G of a set of N positive numbers X u X 2 , X 3 , . . . , X N is the Nth root of the 
product of the numbers: 

G= ^X X X 2 X 3 ---X N (77) 

EXAMPLE 13. The geometric mean of the numbers 2, 4, and 8 is G =^/(2)(4)(8) =^64 = 4. 

We can compute G by logarithms (see Problem 3.35) or by using a calculator. For the geometric 
mean from grouped data, see Problems 3.36 and 3.91. 

THE HARMONIC MEAN H 

The harmonic mean H of a set of N numbers X u X 2 , X 3 , . . . , X N is the reciprocal of the arithmetic 
mean of the reciprocals of the numbers: 
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In practice it may be easier to remember that 

1 ^ Y 



N -N^X ™ 



EXAMPLE 14. The harmonic mean of the numbers 2, 4, and 8 is 
3 3 



For the harmonic mean from grouped data, see Problems 3.99 and 3.100. 



THE RELATION BETWEEN THE ARITHMETIC, GEOMETRIC, AND HARMONIC MEANS 

The geometric mean of a set of positive numbers X x , X 2 ,..., X N is less than or equal to their 
arithmetic mean but is greater than or equal to their harmonic mean. In symbols, 

H < G < X (14) 

The equality signs hold only if all the numbers X { , X 2 , . . ■ , X N are identical. 

EXAMPLE 15. The set 2, 4, 8 has arithmetic mean 4.67, geometric mean 4, and harmonic mean 3.43. 



THE ROOT MEAN SQUARE 

The root mean square (RMS), or quadratic mean, of a set of numbers X x , X 2 , . . . , X N is sometimes 
denoted by and is defined by 



RMS = 




This type of average is frequently used in physical applications. 
EXAMPLE 16. The RMS of the set 1, 3, 4, 5, and 7 is 

/ 2 + 32 + ^ 2+ g±g=^ = 4.47 



QUARTILES, DECILES, AND PERCENTILES 

If a set of data is arranged in order of magnitude, the middle value (or arithmetic mean of the two 
middle values) that divides the set into two equal parts is the median. By extending this idea, we can 
think of those values which divide the set into four equal parts. These values, denoted by Q { , Q 2 , and Q 3 , 
are called the first, second, and third quartiles, respectively, the value Q 2 being equal to the median. 

Similarly, the values that divide the data into 10 equal parts are called deciles and are denoted by D x , 
D 2 , . . ., D 9 , while the values dividing the data into 100 equal parts are called percentiles and are denoted 
by Pi, P 2 , ■ ■ • , ^99. The fifth decile and the 50th percentile correspond to the median. The 25th and 75th 
percentiles correspond to the first and third quartiles, respectively. 

Collectively, quartiles, deciles, percentiles, and other values obtained by equal subdivisions of 
the data are called quantiles. For computations of these from grouped data, see Problems 3.44 to 3.46. 
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EXAMPLE 17. Use EXCEL to find Q u Q 2 , Q 3 , D 9 , and P 95 for the following sample of test scores: 
88 45 53 86 33 86 85 30 89 53 41 96 56 38 62 
71 51 86 68 29 28 47 33 37 25 36 33 94 73 46 
42 34 79 72 88 99 82 62 57 42 28 55 67 62 60 
96 61 57 75 93 34 75 53 32 28 73 51 69 91 35 

To find the first quartile, put the data in the first 60 rows of column A of the EXCEL worksheet. 
Then give the command =PERCENTILE(A1:A60,0.25). EXCEL returns the value 37.75. We find 
that 15 out of the 60 values are less than 37.75 or 25% of the scores are less than 37.75. Similarly, 
we find =PERCENTILE(A1:A60,0.5) gives 57, =PERCENTILE(A1:A60,0.75) gives 76, 
=PERCENTILE(A1:A60,0.9) gives 89.2, and =PERCENTILE(A1:A60,0.95) gives 94.1. EXCEL 
gives quartiles, deciles, and percentiles all expressed as percentiles. 

An algorithm that is often used to find quartiles, deciles, and percentiles by hand is described as 
follows. The data in example 17 is first sorted and the result is: 

test scores 

25 28 28 28 29 30 32 33 33 33 34 34 35 36 37 

38 41 42 42 45 46 47 51 51 53 53 53 55 56 57 

57 60 61 62 62 62 67 68 69 71 72 73 73 75 75 

79 82 85 86 86 86 88 88 89 91 93 94 96 96 99 

Suppose we wish to find the first quartile (25th percentile). Calculate i = np/ 100 = 60(25)/100 = 15. 
Since 15 is a whole number, go to the 15th and 16th numbers in the sorted array and average the 15th 
and 16th numbers. That is average 37 and 38 to get 37.5 as the first quartile (Qi = 37.5). To find the 93rd 
percentile, calculate np/ 100 = 60(93)/ 100 to get 55.8. Since this number is not a whole number, always 
round it up to get 56. The 56th number in the sorted array is 93 and P 93 = 93. The EXCEL command 
=PERCENTILE(A1:A60,0.93) gives 92.74. Note that EXCEL does not give the same values for the 
percentiles, but the values are close. As the data set becomes larger, the values tend to equal each other. 



SOFTWARE AND MEASURES OF CENTRAL TENDENCY 

All the software packages utilized in this book give the descriptive statistics discussed in this section. 
The output for all five packages is now given for the test scores in Example 17. 

EXCEL 

If the pull-down "Tools Data Analysis Descriptive Statistics" is given, the measures of central 
tendency median, mean, and mode as well as several measures of dispersion are obtained: 



Mean 


59.16667 


Standard Error 


2.867425 


Median 


57 


Mode 


28 


Standard Deviation 


22.21098 


Sample Variance 


493.3277 


Kurtosis 


-1.24413 


Skewness 


0.167175 


Range 


74 


Minimum 


25 


Maximum 


99 


Sum 


3550 


Count 


60 
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MINITAB 

If the pull-down "Stat => Basic Statistics Display Descriptive Statistics" is given, the following 
output is obtained: 

Descriptive Statistics: testscore 

Variable N N* Mean SE Mean St Dev Minimum Ql Median Q3 

testscore 60 59.17 2.87 22.21 25.00 37.25 57.00 78.00 

Variable Maximum 
testscore 99.00 

SPSS 

If the pull-down "Analyze =>• Descriptive Statistics =^ Descriptives" is given, the following output is 
obtained: 



Descriptive Statistics 





N 


Minimum 


Maximum 


Mean 


Std. Deviation 


Testscore 
Valid N (listwise) 


60 
60 


25.00 


99.00 


59.1667 


22.21098 



If the pull-down "Solutions =>■ Analysis =>■ Analyst" is given and the data are read in as a file, the 
pull-down "Statistics => Descriptive Summary Statistics" gives the following output: 

The MEANS Procedure 
Analysis Variable : Testscr 

Mean Std Dev N Minimum Maximum 

ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff 
59.1666667 22.2109811 60 25.0000000 99.0000000 

ff ff ff f ff ff ff ff f ff ff ff ff f ff ff ff ff f ff ff f ff ff ff ff f ff ff ff ff f ff ff f ff ff 



If the pull-down "Statistics Summary Statistics Descriptive Statistics" is given in the software 
package STATISTIX, the following output is obtained: 

Statistix 8 . 
Descriptive Staistics 



Mean 59.167 

SD 22.211 

Minimum 2 5.000 

1st Quarti 37.250 

3rdQuarti 78.000 

Maximum 99.00 
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Solved Problems 

SUMMATION NOTATION 

3.1 Write out the terms in each of the following indicated sums: 
(a) jr Xj (c) f> (e) Y.(Xj-a) 

(ft) J2(Yj-3) 2 (d) if A 

SOLUTION 

(a) X x + X 2 + X 3 + X 4 + X 5 + X 6 

(b) (F, - 3) 2 + (Y 2 - 3) 2 + (F 3 - 3) 2 + (Y 4 - 3) 2 

(c) a + a + a H h a = Na 

(d) f t X t +f 2 X 2 +f 3 X 3 +f 4 X 4 +f 5 X 5 

(e) (X, - a) + (X 2 -a) + (X, - a) = X, + X 2 + X 3 - 3a 

3.2 Express each of the following by using the summation notation: 
(a) X\ + x\ + x\ + ■ ■ ■ + Xj 

(ft) (X l + Y 1 ) + (X 2 +Y 2 ) + --- + (X S +Y 8 ) 

(c) f,X\ +f 2 Xl + ■■■ +f 20 X 3 2Q 

(d) a^i+a^ + a^^ h a N b N 

(e) AX, 7, +f 2 X 2 Y 2 +f 3 X 3 Y 3 +f 4 X 4 Y 4 

SOLUTION 

10 20 4 

(«) Y^xj (<•) X; (e) x / A. > 

(A) J2(X J+ Yj) {d) f^ajbj 

3.3 Prove that , (al} + ft 7 ; - cZ y ) = a 1} + ft J^U Y j ~ c T,f =] Z } , where a, ft, and c are any 
constants. 

SOLUTION 

{aXj + bYj- cZj) = (aXi +bY l -cZ l ) + {aX 2 + bY 2 - cZ 2 ) + ■■■ + (aX N + bY N - cZ N ) 

= (aX x + aX 2 + • • • + aX N ) + (bY { + bY 2 + • • • + bY N ) - {cZ x + cZ 2 + • • • + cZ N ) 
= a(X x + X 2 + ■ ■ ■ + X N ) + b{Y { + Y 2 + ■ ■ ■ + Y N ) - c{Z x +Z 2 + --- + Z N ) 

= a±X i + b±Y j -c±Z j 

7=1 7=1 7=1 

or briefly, J2( aX ' + hY ' ~ cZ ) = a J2 x + h J2 Y ' - cE z - 
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3.4 Two variables, X and Y, assume the values X x =2, X 2 = -5, X 3 =4, X 4 = -8 and 7, = -3, 
Y 2 = -8, Y 3 = 10, Y 4 = 6, respectively. Calculate (a) J2 X , (*) E F > (c) Y. XY > ( d ) 
(e) E Y 2 , (J) (E *)(E 10, is) E ^ and (A) J2(X + Y)(X - Y). 
SOLUTION 

Note that in each case the subscript j on X and F has been omitted and E is understood as E/=i ■ Thus 
E X, for example, is short for E/=i x j- 

(a) £X= (2) + (-5) + (4) + (-8) = 2 - 5 + 4 - 8 = -7 

(*) E Y = (-3) + (-8) + (10) + (6) = -3 - 8 + 10 + 6 = 5 

(c) Y.XY = (2)(-3) + (-5)(-8) + (4)(10) + (-8)(6) = -6 + 40 + 40 - 48 = 26 

(</) J2X 2 = (2) 2 + (-5) 2 + (4) 2 + (-8) 2 = 4 + 25 + 16 + 64 = 109 

(e) J] F 2 = (-3) 2 + (-8) 2 + (10) 2 + (6) 2 = 9 + 64 + 100 + 36 = 209 

00 (£*)(£ Y) = (-7)(5) = -35, using parts (a) and (ft). Note that F) / E^. 

fe) E^ 2 = (2)(-3) 2 + (-5)(-8) 2 + (4)(10) 2 + (-8)(6) 2 = -190 

(h) E + 7)(X - F) = E (^ 2 - Y 2 ) = E A" 2 - E Y 2 = 109 - 209 = -100, using parts (d) and (e). 



In a USA Today snapshot, it was reported that per capita state taxes collected in 2005 averaged 
$2189.84 across the United States. The break down is: Sales and receipts $1051.42, Income 
$875.23, Licenses $144.33, Others $80.49, and Property $38.36. Use EXCEL to show that the 
sum equals $2189.84. 

SOLUTION 

Note that the expression = 



m(Al:A5) is equivalent to y^l}. 



1051.42 
875.23 
144.33 
80.49 
38.36 
2189.83 



sales and receipts 

income 

licenses 

others 

property 

=sum(A1 :A5) 



THE ARITHMETIC MEAN 



3.6 The grades of a student on six examinations were 84, 91, 72, 68, 87, and 78. Find the arithmetic 
mean of the grades. 



_ E X _ 84 + 91 + 72 + 68 + 87 + 78 _ 



e uses the U 
ce there are 



n average synonymously with arithmetic it 
verages other than the arithmetic mean. 



i. Strictly speaking, however, 



3.7 Ten measurements of the diameter of a cylinder were recorded by a scientist as 3.88, 4.09, 3.92, 
3.97, 4.02, 3.95, 4.03, 3.92, 3.98, and 4.06 centimeters (cm). Find the arithmetic mean of the 
measurements. 
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_ E Z _ 3.88 + 4.09 + 3.92 + 3.97 + 4.02 + 3.95 + 4.03 + 3.92 + 3.98 + 4.06 _ 39.82 _ 



3.8 The following MINITAB output shows the time spent per week on line for 30 Internet users 
well as the mean of the 30 times. Would you say this average is typical of the 30 times? 

MTB > print cl 

Data Display 

time 

3 4 4 5 5 
6 6 6 7 7 
9 10 10 10 10 

MTB > mean cl 

Column Mean 

Mean of time = 10.400 

SOLUTION 

The mean 10.4 hours is not typical of the times. Note that 21 of the 30 times are in the single digits, but 
the mean is 10.4 hours. A great disadvantage of the mean is that it is strongly affected by extreme values. 



5 5 5 5 6 
7 7 7 8 8 
10 10 12 55 60 



3.9 Find the arithmetic mean of the numbers 5, 3, 6, 5, 4, 5, 2, 8, 6, 5, 4, 8, 3, 4, 5, 4, 8, 2, 5, and 4. 



SOLUTION 



First method 



? ^EX = 5 + 3 + 6+5 + 4 + 5 + 2 + 8 + 6 + 5 + 4 + 8 + 3 + 4 + 5 + 4 + 8 + 2 + 5 + 4 = 96 

N " " " 20 ~~ ~ ~~ ~ ~ ~ ~ 20 



Second method 

There are six 5*s, two 3*s, two 6*s, five 4*s, two 2's and three 8's. Thus 

v N : / V HfX _ (6)(5) + (2)(3) + (2)(6) + (5)(4) + (2)(2) + (3)(8) = 96 

£/ N 6 + 2 + 2 + 5 + 2 + 3 20 ' 



3.10 Out of 100 numbers, 20 were 4's, 40 were 5's, 30 were 6's and the remainder were 7's. Find the 
arithmetic mean of the numbers. 



N :/V n :.'V (20) (4) + (40)(5) + (30)(6) + (10)(7) = 530 = 
J2.f N 100 100 



3.11 A student's final grades in mathematics, physics, English and hygiene are, respectively, 82, 86, 90, 
and 70. If the respective credits received for these courses are 3, 5, 3, and 1, determine an 
appropriate average grade. 
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SOLUTION 

We use a weighted arithmetic mean, the weights associated with each grade being taken as the number 
of credits received. Thus 

Y E'rf (3)(82) + (5)(86) + (3)(90) + (l)(70) Rs 
X = T^ = 3 + 5 + 3+1 =85 



3.12 In a company having 80 employees, 60 earn $10.00 per hour and 20 earn $13.00 per hour. 

(a) Determine the mean earnings per hour. 

(b) Would the answer in part (a) be the same if the 60 employees earn a mean hourly wage of 
$10.00 per hour? Prove your answer. 

(c) Do you believe the mean hourly wage to be typical? 

SOLUTION 

(a) 

X = EZ* = (60)($10.00) + (20)($13.00) = ?J 
N 60 + 20 

(b) Yes, the result is the same. To prove this, suppose that f { numbers have mean ni\ and that f 2 numbers 
have mean m 2 . We must show that the mean of all the numbers is 

v _ h m \ +fim 2 

A+f2 

Let the f x numbers add up to M x and the f 2 numbers add up to M 2 . Then by definition of the 
arithmetic mean, 

Mi , M 2 

fit] = — and m 2 = — 
Jx h 

or Mi =j\ni\ and M 2 =f 2 m 2 . Since all (J\ +/ 2 ) numbers add up to (Mi + M 2 ), the arithmetic mean of 
all numbers is 

= Mi+M 2 fjmi +f 2 m 2 
fx +fi fx +fi 

as required. The result is easily extended. 

(c) We can say that $10.75 is a "typical" hourly wage in the sense that most of the employees earn $10.00, 
which is not too far from $10.75 per hour. It must be remembered that, whenever we summarize 
numerical data into a single number (as is true in an average), we are likely to make some error. 
Certainly, however, the result is not as misleading as that in Problem 3.8. 

Actually, to be on safe ground, some estimate of the "spread," or "variation," of the data about the 
mean (or other average) should be given. This is called the dispersion of the data. Various measures of 
dispersion are given in Chapter 4. 

3.13 Four groups of students, consisting of 15, 20, 10, and 18 individuals, reported mean weights of 
162, 148, 153, and 140 pounds (lb), respectively. Find the mean weight of all the students. 

SOLUTION 

y _ £7^ _ (15)(162) + (20)(148) + (10)(153) + (18)(140) 
X ~YJ~ 15 + 20+10+18 - 15 ° lb 



3.14 If the mean annual incomes of agricultural and nonagricultural workers are $25,000 and $35,000, 
respectively, would the mean annual income of both groups together be $30,000? 
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SOLUTION 

It would be $30,000 only if the numbers of agricultural and nonagricultural workers were the same. 
To determine the true mean annual income, we would have to know the relative numbers of workers in 
each group. Suppose that 10% of all workers are agricultural workers. Then the mean would be 
(0.10) (25, 000) + (0.90) (35, 000) = $34,000. If there were equal numbers of both types of workers, the 
mean would be (0.50)(25, 000) + (0.50)(35, 000) = $30,000. 

3.15 Use the frequency distribution of heights in Table 2.1 to find the mean height of the 100 male 
students at XYZ university. 

SOLUTION 

The work is outlined in Table 3.1. Nolo thai all students having heights 60 to 62 inches (in), 63 to 65 in, 
etc., are considered as having heights 61 in, 64 in, etc. The problem then reduces to finding the mean height 
of 100 students if 5 students have height 61 in, 18 have height 64 in, etc. 

The computations involved can become tedious, especially for cases in which the numbers are large and 
many classes are present. Short techniques are available for lessening the labor in such cases; for example, 
see Problems 3.20 and 3.22. 



Table 3.1 



Height (in) 


Class Mark (X) 


Frequency (/) 


fx 


60-62 


61 


5 


305 


63-65 


64 


18 


1152 


66-68 


67 


42 


2814 


69-71 


70 


27 


1890 


72-74 


73 


8 


584 




N = £ / = 100 


EfX = 6745 



y £A Y,fX 6745 

I= E7 = ^ = W = 67 - 45 m 

PROPERTIES OF THE ARITHMETIC MEAN 

3.16 Prove that the sum of the deviations of I b I 2 ,..., X N from their mean X is equal to zero. 
SOLUTION 

Let d l =X x -X,d 2 = X 1 -X,...,d N = X N - X be the deviations of X u X 2 , . . . , X N from their mean 
X. Then 

Sum of deviations = £ 4 = £ {Xj - X) = £ X } - NX 

= Zx j -n(^i s )=zxj-ZXj = o 

where we have used £ in place of J2jLi ■ We could, if desired, have omitted the subscript j in Xj, provided 
that it is understood. 

3.17 lfZ 1 =X l + Y u Z 2 = X 2 +Y 2 ,...,Z N = X N + Y N , prove that Z = X + Y. 
SOLUTION 

By definition, 

y_^X £ y 7_£z 

N N N 
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Thus 2 = ZZ = Z(X+Y) = ZX + ZY^ ZY = 

N N N N N 

where the subscripts j on X, Y, and Z have been omitted and where J2 means E>lr 

3.18 (a) If N numbers X x , X 2 ,...,X N have deviations from any number A given by d x — X x - A, 
d 2 = X 2 - A, . . . , d N = X N - A, respectively, prove that 



£4 



(b) In case X x , X 2 ,...,X K have respective frequencies f x , f 2 , . . . ,f K and d x - 
dt: = X K - A, show that the result in part (a) is replaced with 



(a) First method 

Since d t = X t - A and Xj = A + d h we have 

y = Ej} = EU + 4) = Ejj+E 4 = A^j+Ej = E 4 

TV TV iV AT JV 

where we have used J2 m place of J]^, for brevity. 
Second method 

We have d = X - A, or X = A + d, omitting the subscripts on d and X. Thus, by Problem 3.17, 

X = A + d = A + ^ 
N 

since the mean of a number of constants all equal to A is A. 

itfj x i 

x= ^r = ^^= iv = — n ■ ' = — * — - 

_ ANJrYJjd l _ 

JV " 

Note that formally the result is obtained from part (a) by replacing ^ with fj dj and summing from j = 1 
to instead of from j = 1 to AT. The result is equivalent to X = A + d. where d = (52fd)/N. 



THE ARITHMETIC MEAN COMPUTED FROM GROUPED DATA 



3.19 Use the method of Problem 3.18(a) to find the arithmetic mean of the numbers 5, 8, 11, 9, 12, 6, 
14, and 10, choosing as the "guessed mean" A the values (a) 9 and (b) 20. 
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SOLUTION 

(a) The deviations of the given numbers from 9 are -4, — 1, 2, 0, 3, -3, 5, and 1, and the sum of the 
deviations is E ^ = ~ 4 -1 + 2 + + 3- 3 + 5 + 1= 3. Thus 

X = ,4+=^ = 9 + | = 9.375 
N 8 

(b) The deviations of the given numbers from 20 are —15, —12, —9, —11, —8, —14, —6, and —10, and 
Y.d= -85. Thus 

^ + ^ = 20 + ^ = 9.375 



3.20 Use the method of Problem 3.18(6) to find the arithmetic mean of the heights of the 100 male 
students at XYZ University (see Problem 3.15). 



SOLUTION 

The work may be arranged as in Tabic 3.2. We take (he guessed mean A to be the class mark 67 (which 
has the largest frequency), although any class mark can be used for A. Note that the computations are 
simpler than those in Problem 3.15. To shorten the labor even more, we can proceed as in Problem 3.22, 
where use is made of the fact that the deviations (column 2 of Table 3.2) are all integer multiples of the 
class-interval size. 



Table 3.2 



Class Mark (X) 


Deviation 

d = X -A 


Frequency (/) 


fd 


61 


-6 


5 


-30 


64 


-3 


18 


-54 


A — >67 





42 





70 


3 


27 


81 


73 


6 


8 


48 




N = E / = ioo 


E fd = 45 



3.21 Let dj = Xj — A denote the deviations of any class mark Xj in a frequency distribution from a 
given class mark A. Show that if all class intervals have equal size c, then (a) the deviations are all 
multiples of c (i.e., dj — cuj, where Uj = 0, ±1, ±2,...) and (b) the arithmetic mean can be 
computed from the formula 

' — (¥> 

SOLUTION 

(a) The result is illustrated in Table 3.2 of Problem 3.20, where it is observed that the deviations in column 
2 are all multiples of the class-interval size c = 3 in. 

To see that the result is true in general, note that if X u X 2 , X 3 , . . . are successive class marks, their 
common difference will for this case be equal to c, so that X 2 = X x + c, X 3 = X x + 2c, and in general 
Xj = X\ + ( j — \)c. Then any two class marks X p and X q , for example, will differ by 

X p -X q = [X l + {p- l)c] - [X, +{q- l)c] = (p-q)c 

which is a multiple of c. 
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(b) By part (a), the deviations of all the class marks from any given one are multiples of c (i.e., d t = cuj). 
Then, using Problem 3.18(6), we have 

N N N \ N J 

Note that this is equivalent to the result X = A + cu, which can be obtained from X = A + d by 
placing d = cu and observing that d = cu (see Problem 3.18). 

3.22 Use the result of Problem 3.21(6) to find the mean height of the 100 male students at XYZ 
University (see Problem 3.20). 



The work may be arranged as in Table 3.3. The method is called the coding method and should be 
employed whenever possible. 



X 




/ 


fu 


61 


-2 


5 


-10 


64 


-1 


18 


-18 


67 





42 





70 


1 


27 


27 


73 


2 




16 




N = 100 


£/"= 15 



= 67+ 7^ « = 67.45 ir 



3.23 Compute the mean weekly wage of the 65 employees at the P&R Company from the frequency 
distribution in Table 2.5, using (a) the long method and (b) the coding method. 

SOLUTION 

Tables 3.4 and 3.5 show the solutions to (a) and (b), respectively. 



X 


/ 


fx 




X 




/ 


/" 


$255.00 




$2040.00 




$255.00 


-2 


8 


-16 


265.00 


10 


2650.00 




265.00 


-1 


10 


-10 


275.00 


16 


4400.00 


A - 


-> 275.00 





16 





285.00 


14 


3990.00 




285.00 


1 


14 


14 


295.00 


10 


2950.00 




295.00 




10 


20 


305.00 


5 


1525.00 




305.00 


3 


5 


15 


315.00 




630.00 




315.00 


4 


2 


8 




TV = 65 


Y, fX = $18,185.00 






N= 65 


E/"=31 
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It might be supposed that error would be introduced into these tables since the class marks are actually 
$254,995, $264,995, etc., instead of $255.00, $265.00, etc. If in Table 3.4 these true class marks are used 
instead, then X turns out to be $279.76 instead of $279.77, and the difference is negligible. 



_ J2f x _ $18,185.00 _ 



c = $275.00 + — ($10.00) = $279.77 



3.24 Using Table 2.9(d), find the mean wage of the 70 employees at the P&R Company. 
SOLUTION 

In this case the class intervals do not have equal size and we must use the long method, as shown ii 
Table 3.6. 



X 


/ 


fx 


$255.00 




$2040.00 


265.00 


10 


2650.00 


275.00 


16 


4400.00 


285.00 


15 


4275.00 


295.00 


10 


2950.00 


310.00 




2480.00 


350.00 


3 


1050.00 




iV = 70 


Y, fX = $19,845.00 



_J2fX_ $19,845.00 _ 



THE MEDIAN 

3.25 The following MINITAB output shows the time spent per week searching on line for 30 Internet 
users as well as the median of the 30 times. Verify the median. Would you say this average is 
typical of the 30 times? Compare your results with those found in Problem 3.8. 

MTB > print cl 

Data Display 

time 

3 4 4 5 
6 6 6 7 
9 10 10 10 

MTB > median cl 

Column Median 

Median of time = 7 . 0000 

SOLUTION 

Note that the two middle values are both 7 and the mean of the two middle values is 7. The mean was 
found to be 10.4 hours in Problem 3.8. The median is more typical of the times than the mean. 



5 5 5 5 5 6 
7 7 7 7 8 8 
10 10 10 12 55 60 
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3.26 The number of ATM transactions per day were recorded at 1 5 locations in a large city. The data 
were: 35, 49, 225, 50, 30, 65, 40, 55, 52, 76, 48, 325, 47, 32, and 60. Find (a) the median number of 
transactions and (b) the mean number of transactions. 

SOLUTION 

(a) Arranged in order, the data are: 30, 32, 35, 40, 47, 48, 49, 50, 52, 55, 60, 65, 76, 225, and 325. 
Since there is an odd number of items, there is only one middle value, 50, which is the required 
median. 

(b) The sum of the 15 values is 1189. The mean is 1189/15 = 79.267. 

Note that the median is not affected by the two extreme values 225 and 325, while the mean is 
affected by it. In this case, the median gives a better indication of the average number of daily ATM 



3.27 If (a) 85 and (b) 150 numbers are arranged in an array, how would you find the median of the 
numbers? 

SOLUTION 

(a) Since there are 85 items, an odd number, there is only one middle value with 42 numbers below and 42 
numbers above it. Thus the median is the 43rd number in the array. 

(b) Since there are 1 50 items, an even number, there are two middle values with 74 numbers below them 
and 74 numbers above them. The two middle values are the 75th and 76th numbers in the array, and 
their arithmetic mean is the required median. 

3.28 From Problem 2.8, find the median weight of the 40 male college students at State University by 
using (a) the frequency distribution of Table 2.7 (reproduced here as Table 3.7) and (b) the 
original data. 

SOLUTION 

(a) First method (using interpolation) 

The weights in the frequency distribution of Table 3.7 are assumed to be continuously distributed. 
In such case the median is that weight for which half the total frequency (40/2 = 20) lies above it and 
half lies below it. 



Table 3.7 



Weight (lb) 


Frequency 


118-126 


3 


127-135 


5 


136-144 


9 


145-153 


12 


154-162 


5 


163-171 


4 


172-180 






Total 40 



Now the sum of the first three class frequencies is 3 + 5 + 9 = 17. Thus to give the desired 20, 
we require three more of the 12 cases in the fourth class. Since the fourth class interval, 145-153, 
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actually corresponds to weights 144.5 to 153.5, the median must lie 3/12 of the way between 144.5 and 
153.5; that is, the median is 

144.5 + ^(153.5- 144.5) = 144.5 + ^(9) = 146.8 1b 

Second method (using formula) 

Since the sum of the first three and first four class frequencies are 3 + 5 + 9=17 and 
3 + 5 + 9+ 12 = 29, respectively, it is clear that the median lies in the fourth class, which is therefore 
the median class. Then 

L] = lower class boundary of median class = 144.5 
N = number of items in the data = 40 
(E f)\ = sum of all classes lower than the median class = 3 + 5 + 9=17 
/median = frequency of median class =12 
c = size of median class interval = 9 

and thus 

Me dl an = L, + ( N/2 J^ f)l ) c = 144.5 + ") (9) = 146.8 lb 

(b) Arranged in an array, the original weights are 

119, 125, 126, 128, 132, 135, 135, 135, 136, 138, 138, 140, 140, 142, 142, 144, 144, 145, 145, 146, 
146, 147, 147, 148, 149, 150, 150, 152, 153, 154, 156, 157, 158, 161, 163, 164, 165, 168, 173, 176 
The median is the arithmetic mean of the 20th and 21st weights in this array and is equal to 1461b. 



3.29 Figure 3-3 gives the stem-and-leaf display for the number of 2005 alcohol-related traffic deaths 
for the 50 states and Washington D.C. 
Stem-and-Leaf Display: Deaths 

Stem-and-Leaf Display: Deaths 

Stem-and-leaf of Deaths N = 51 
Leaf Unit = 10 



14 





2233455666 


23 


1 


122255778 


(7) 


2 


0334689 


21 


3 


124679 


15 


4 


22669 


10 


5 


012448 


4 


6 


3 


3 


7 




3 


8 




3 


9 




3 


10 




3 


11 




3 


12 




3 


13 




3 


14 


7 


2 


15 


6 


1 


16 




1 


17 





Fig. 3-3 MIN1TAB stem-and-leaf display for 2005 alcohol-related traffic deaths. 
Find the mean, median, and mode of the alcohol-related deaths in Fig. 3-3. 
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SOLUTION 

The number of deaths range from 20 to 1710. The distribution is bimodal. The two modes are 
60 and 120, both of which occur three times. 

The class (7) 2 0334689 is the median class. That is, the median is 
contained in this class. The median is the middle of the data or the 26th value in the ordered 
array. The 24th value is 200, the 25th value is 230, and the 26th value is also 230. Therefore 
the median is 230. 

The sum of the 51 values is 16,660 and the mean is 16,660/51 = 326.67. 

3.30 Find the median wage of the 65 employees at the P&R Company (see Problem 2.3). 
SOLUTION 

Here N = 65 and N/2 = 32.5. Since the sum of the first two and first three class frequencies 
are 8+10=18 and 8+10+16 = 34, respectively, the median class is the third class. Using the 
formula, 

Median = L x + ^"^^' j c = $269,995 + ( ^ 2 ' 5 ~ ' 8 j ($10.00) = $279.06 

THE MODE 

3.31 Find the mean, median, and mode for the sets (a) 3, 5, 2, 6, 5, 9, 5, 2, 8, 6 and (b) 51.6, 48.7, 50.3, 
49.5, 48.9. 

SOLUTION 

(a) Arranged in an array, the numbers are 2, 2, 3, 5, 5, 5, 6, 6, 8, and 9. 

Mean = ^(2 + 2 + 3 + 5 + 5 + 5 + 6 + 6 + 8 + 9) = 5.1 
Median = arithmetic mean of two middle numbers = i(5 + 5) = 5 
Mode = number occurring most frequently = 5 

(b) Arranged in an array, the numbers are 48.7, 48.9, 49.5, 50.3, and 51.6. 

Mean = i (48.7 + 48.9 + 49.5 + 50.3 + 51.6) = 49.8 
Median = middle number = 49.5 
Mode = number occurring most frequently (nonexistent here) 

3.32 Suppose you desired to find the mode for the data in Problem 3.29. You could use frequencies 
procedure of SAS to obtain the following output. By considering the output from FREQ 
Procedure (Fig. 3-4), what are the modes of the alcohol-linked deaths? 
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The FREQ Procedure 
Deaths 









Cumulative 


Cumulative 


deaths 


Frequency 


Percent 


Frequency 


Percent 


/////////////////////////////////////////////////////////// 


20 










30 


2 


3.92 


4 


7.84 












50 










60 


3 


5.88 


10 


19.61 












80 


2 


3.92 


13 


25.49 


90 




1.96 


14 


27.45 


110 




1.96 


15 


29.41 


120 


3 


5.88 


18 


35.29 


150 


2 


3.92 


20 


39.22 


170 










180 




1.96 


23 


45.10 


200 


1 


1.96 


24 


47.06 


230 




3.92 


26 


50.98 






1.96 




52.94 


260 




1.96 


28 


54.90 


280 




1.96 


29 


56.86 


290 




1.96 


30 


58.82 


310 


1 


1.96 


31 


60.78 












340 


1 


1.96 


33 


64.71 












370 




1.96 


35 


68.63 


390 




1.96 


36 


70.59 


420 




3.92 


38 


74.51 


460 




3.92 


40 


78.43 


490 




1.96 


41 


80.39 


500 




1.96 


42 


82.35 


510 




1.96 


43 


84.31 


520 




1.96 


44 


86.27 


540 




3.92 


46 


90.20 


580 




1.96 


47 


92.16 


630 




1.96 


48 


94.12 


1470 




1.96 




96.08 


1560 




1.96 


50 


98.04 


1710 




1.96 


51 


100.00 



Fig. 3-4 SAS output for FREQ procedure and the alcohol-related deaths. 



SOLUTION 

The data is bimodal and the modes are 60 and 120. This is seen by considering the SAS 
output and noting that both 60 and 120 have a frequency of 3 which is greater than any other of 
the values. 

3.33 Some statistical software packages have a mode routine built in but do not return all modes in the 
case that the data is multimodal. Consider the output from SPSS in Fig. 3-5. 



What does SPSS do when asked to find modes? 
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Deaths 





Frequency 


Percent 


Valid Percent 


Cumulative 
Percent 


Valid 20.00 


2 


3.9 


3.9 


3.9 


30.00 


2 


3.9 


3.9 


7.8 


40.00 


1 


2.0 


2.0 


9.8 


50.00 


2 


3.9 


3.9 


13.7 


60.00 


3 


5.9 


5.9 


19.6 


70.00 


1 


2.0 


2.0 


21.6 


80.00 


2 


3.9 


3.9 


25.5 


90.00 


1 


2.0 


2.0 


27.5 


110.00 


1 


2.0 


2.0 


29.4 


120.00 


3 


5.9 


5.9 


35.3 


150.00 


2 


3.9 


3.9 


39.2 


170.00 


2 


3.9 


3.9 


43.1 


180.00 


1 


2.0 


2.0 


45.1 


200.00 


1 


2.0 


2.0 


47.1 


230.00 




3.9 


3.9 


51.0 


240.00 


1 


2.0 


2.0 


52.9 


260.00 


1 


2.0 


2.0 


54.9 


280.00 


1 


2.0 


2.0 


56.9 


290.00 


1 


2.0 


2.0 


58.8 


310.00 


1 


2.0 


2.0 


60.8 


320.00 


1 


2.0 


2.0 


62.7 


340.00 


1 


2.0 


2.0 


64.7 


360.00 


1 


2.0 


2.0 


66.7 


370.00 


1 


2.0 


2.0 


68.6 


390.00 


1 


2.0 


2.0 


70.6 


420.00 




3.9 


3.9 


74.5 


460.00 




3.9 


3.9 


78.4 


490.00 


1 


2.0 


2.0 


80.4 


500.00 


1 


2.0 


2.0 


82.4 


510.00 


1 


2.0 


2.0 


84.3 


520.00 




2.0 


2.0 


86.3 


540.00 




3.9 


3.9 


90.2 


580.00 




2.0 


2.0 


92.2 


630.00 




2.0 


2.0 


94.1 


1470.00 




2.0 


2.0 


96.1 


1560.00 




2.0 


2.0 


98.0 


1710.00 




2.0 


2.0 


100.0 


Total 


51 


100.0 


100.0 





Statistics 

Deaths 



N Valid 


51 


Missing 







60.00 a 





Fig. 3-5 SPSS output for the alcohol-related deaths. 
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SOLUTION 

SPSS prints out the smallest mode. However, it is possible to look at the frequency distribution and find 
all modes just as in SAS (see the output above). 

THE EMPIRICAL RELATION BETWEEN THE MEAN, MEDIAN, AND MODE 

3.34 (a) Use the empirical formula mean — mode = 3 (mean — median) to find the modal wage of the 

65 employees at the P&R Company. 
(b) Compare your result with the mode obtained in Problem 3.33. 
SOLUTION 

(a) From Problems 3.23 and 3.30 we have mean = $279.77 and median = $279.06. Thus 

Mode = mean - 3(mean - median) = $279.77 - 3($279.77 - $279.06) = $277.64 

(b) From Problem 3.33 the modal wage is $277.50, so there is good agreement with the empirical result in 
this case. 

THE GEOMETRIC MEAN 

3.35 Find (a) the geometric mean and (b) the arithmetic mean of the numbers 3, 5, 6, 6, 7, 10, and 12. 
Assume that the numbers are exact. 

SOLUTION 

(a) Geometric mean = G = ^/(3)(5)(6)(6)(7)(10)(12) = ^453,600. Using common logarithms, logG = 
i log453,600 = i (5.6567) = 0.8081, and G = 6.43 (to the nearest hundredth). Alternatively, a calcula- 
tor can be used. 

Another method 

log G = i (log 3 + log 5 + log 6 + log 6 + log 7 + log 10 + log 12) 

= i (0.4771 + 0.6990 + 0.7782 + 0.7782 + 0.8451 + 1.0000 + 1.0792) 
= 0.8081 
and G = 6.43 

(b) Arithmetic mean = Z = | (3 + 5 + 6 + 6 + 7+ 10+ 12) = 7. This illustrates the fact that the geometric 
mean of a set of unequal positive numbers is less than the arithmetic mean. 

3.36 The numbers X l ,X 2 ,...,X K occur with frequencies / 1; / 2 , . . . , f K , where f x + f 2 H h f K — N is 

the total frequency. 

(a) Find the geometric mean G of the numbers. 

(b) Derive an expression for log G. 

(c) How can the results be used to find the geometric mean for data grouped into a frequency 
distribution? 

SOLUTION 

(a) G = y/ XtXf-Xt X 2 X 2 ---X 2 ---X k X k ---X~k = \Jx{<xj ■ ■ ■ X f £ 

/, times h t^es f K times 

where N = J2f- This is sometimes called the weighted geometric mean. 
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(b) log G = i log {X* x£---X f /)=^ (/! log X, + / 2 log X 2 + ■ ■ ■ + f K log X K ) 

where we assume that all the numbers are positive; otherwise, the logarithms are not defined. 

Note that the logarithm of the geometric mean of a set of positive numbers is the arithmetic mean of 
the logarithms of the numbers. 

(c) The result can be used to find the geometric mean for grouped data by taking X x , X 2 , . . . ,X K as class 
marks and f\, f 2 , ■ ■ ■ , fx as the corresponding class frequencies. 

3.37 During one year the ratio of milk prices per quart to bread prices per loaf was 3.00, whereas 
during the next year the ratio was 2.00. 

(a) Find the arithmetic mean of these ratios for the 2-year period. 

(b) Find the arithmetic mean of the ratios of bread prices to milk prices for the 2-year period. 

(c) Discuss the advisability of using the arithmetic mean for averaging ratios. 

(d) Discuss the suitability of the geometric mean for averaging ratios. 

SOLUTION 

(a) Mean ratio of milk to bread prices = \ (3.00 + 2.00) = 2.50. 

(b) Since the ratio of milk to bread prices for the first year is 3.00, the ratio of bread to milk prices 
is 1/3.00 = 0.333. Similarly, the ratio of bread to milk prices for the second year is 1/2.00 = 0.500. 
Thus 

Mean ratio of bread to milk prices = \ (0.333 + 0.500) = 0.417 

We would expect the mean ratio of milk to bread prices to be the reciprocal of the mean ratio of bread 
to milk prices if the mean is an appropriate average. However, 1 /0.417 = 2.40 ^ 2.50. This shows that 
the arithmetic mean is a poor average to use for ratios. 

Geometric mean of ratios of milk to bread prices = \/(3.00)(2.00) = a/6.00 
Geometric mean of ratios of bread to milk prices = ^(0.333) (0.500) = V0.0167 = 1/V6M 

Since these averages are reciprocals, our conclusion is that the geometric mean is more suitable than the 
arithmetic mean for averaging ratios for this type of problem. 

3.38 The bacterial count in a certain culture increased from 1000 to 4000 in 3 days. What was the 
average percentage increase per day? 

SOLUTION 

Since an increase from 1000 to 4000 is a 300% increase, one might be led to conclude that the average 
percentage increase per day would be 300%/3 = 100%. This, however, would imply that during the first day 
the count went from 1000 to 2000, during the second day from 2000 to 4000, and during the third day from 
4000 to 8000, which is contrary to the facts. 

To determine this average percentage increase, let us denote it by r. Then 
Total bacterial count after 1 day = 1000 + 1000c = 1000(1 + r) 
Total bacterial count after 2 days = 1000(1 + r) + 1000(1 + r)r = 1000(1 + rf 
Total bacterial count after 3 days = 1000(1 + rf + 1000(1 + rfr = 1000(1 + rf 

This last expression must equal 4000. Thus 1000(1 + rf = 4000, (1 + rf = 4, 1 + r = \/4, and r = \/4 - 1 = 
1.587 - 1 = 0.587, so that r = 58.7%. 
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In general, if we start with a quantity P and increase it at a constant rate r per unit of time, we will have 
after n units of time the amount 

A = P(l+r) n 

This is called the compound-interest formula (see Problems 3.94 and 3.95). 
THE HARMONIC MEAN 

3.39 Find the harmonic mean H of the numbers 3, 5, 6, 6, 7, 10, and 12. 
SOLUTION 



1 - 1 V 1 - 1 ( 



1/11111 1 1 \ 

6 + 6 + 7+10 + I2J 



2940 

and * = f£=5.87 

It is often convenient to express the fractions in decimal form first. Thus 

^ = i (0.3333 + 0.2000 + 0.1667 + 0.1667 + 0.1429 + 0.1000 + 0.0833) 

_ 1.1929 

7 

and ^nkr 5 - 87 

Comparison with Problem 3.35 illustrates the fact that the harmonic mean of several 
positive numbers not all equal is less than their geometric mean, which is in turn less than their 



3.40 During four successive years, a home owner purchased oil for her furnace at respective costs of 
$0.80, $0.90, $1.05, and $1.25 per gallon (gal). What was the average cost of oil over the 4-year 
period? 



Case 1 

Suppose that the home owner purchased the same quantity of oil each year, say 1000 gal. Then 

total cost $800 + $900 + $1050 + $1250 nn . , 

Avci aire cost : - 4 1 .00 a;i I 

total quantity purchased 4000 gal 

This is the same as the arithmetic mean of the costs per gallon; that is, \ ($0.80 + $0.90 + $ 1 .05 + $ 1 .25) = 
1 .00/gal. This result would be the same even if x gallons were used each year. 

Case 2 

Suppose that the home owner spends the same amount of money each year, say $1000. Then 

total cost $4000 
Average cost = — — ^ = (1250+ + 95 2 + 800)g a l = $ °- 975/gal 
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This is the same as the harmonic mean of the costs per gallon: 

J_ _1_ 4 J_ X = 0-975 

0.80 + 0.90 + 1.05 + 1.25 
This result would be the same even if y dollars were spent each year. 

Both averaging processes are correct, each average being computed under different prevailing 
conditions. 

It should be noted that in case the number of gallons used changes from one year to another instead of 
remaining the same, the ordinary arithmetic mean of Case 1 is replaced by a weighted arithmetic mean. 
Similarly, if the total amount spent changes from one year to another, the ordinary harmonic mean of Case 2 
is replaced by a weighted harmonic mean. 



3.41 A car travels 25 miles at 25 miles per hour (mi/h), 25 miles at 50 mph, and 25 miles at 75 mph. 
Find the arithmetic mean of the three velocities and the harmonic mean of the three velocities. 
Which is correct? 



The average velocity is equal to the distance traveled divided by the total time and is equal to the 
following: 

15 { p =40.9 mi/h 
( 1+ 2 + 3j 
The arithmetic mean of the three velocities is: 

25 + 50 + 75 cn . „ 



The harmonic mean is found as follows: 

H~ N^X~ 3 \25 + 50 H 
The harmonic mean is the correct measure of the average velocity. 

THE ROOT MEAN SQUARE, OR QUADRATIC MEAN 

3.42 Find the quadratic mean of the numbers 3, 5, 6, 6, 7, 10, and 12. 



SOLUTION 

Quadratic mean = RMS = f ± * ± I ± * + ± l * ± ^ = ^57 = 7.55 



3.43 Prove that the quadratic mean of two positive unequal numbers, a and b, is greater than their 
geometric mean. 

SOLUTION 

We are required to show that \J\(fl 2 + b 2 ) > Vab. If this is true, then by squaring both sides, 
\ (a 2 + b 2 ) > ab, so that a 2 + b 2 > lab, a 2 - lab + b 2 > 0, or (a - b f > 0. But this last inequality is true, 
since the square of any real number not equal to zero must be positive. 
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The proof consists in establishing the reversal of the above steps. Thus starting with ( a - b) 2 > , which 
we know to be true, we can show that a 2 + b 2 > lab, \ (a 2 + b 2 ) > ab, and finally \J\{a 2 + b 2 ) > \fab~, 
as required. 

Note that ^\(a 2 + b 2 ) = Jab if and only if a = b. 



QUARTILES, DECILES, AND PERCENTILES 

3.44 Find (a) the quartiles Q x , Q 2 , and Q 3 and (b) the deciles D { , D 2 , ...,D 9 for the wages of the 
65 employees at the P&R Company (see Problem 2.3). 



(a) The first quartile Q t is that wage obtained by counting N/4 = 65/4 = 16.25 of the cases, beginning with 
the first (lowest) class. Since the first class contains 8 cases, we must take 8.25 (16.25 - 8) of the 10 cases 
from the second class. Using the method of linear interpolation, we have 

g, = $259,995 + ^ ($10.00) = $268.25 

The second quartile Q 2 is obtained by counting the first 2N/4 = N/2 = 65/2 = 32.5 of the cases. 
Since the first two classes comprise 18 cases, we must take 32.5 - 18 = 14.5 of the 16 cases from the 
third class; thus 

Q 2 = $269,995 + ^ ($10.00) = $279.06 

Note that Q 2 is actually the median. 

The third quartile g 3 is obtained by counting the first 3iV/4 = | (65) — 48.75 of the cases. 
Since the first four classes comprise 48 cases, we must take 48.75 - 48 = 0.75 of the 10 cases from the 
fifth class; thus 



Hence 25% of the employees earn $268.25 or less, 50% earn $279.06 or less, and 75% earn $290.75 

(b) The first, second,..., ninth deciles are obtained by counting TV/10, 2JV/10, . . . , 97V/10 of the cases, 
beginning with the first (lowest) class. Thus 

D x = $249,995 + ^ ($10.00) = $258.12 D 6 = $279,995 + ^ ($10.00) = $283.57 

D 2 = $259.995 + ^($10.00) = $265.00 D 7 = $279,995 + ^ ($10.00) = $288.21 

D 3 = $269,995 + ^ ($10.00) = $270.94 D H = $289,995 + ^ ($10.00) = $294.00 

D 4 = $269.995 + ^($10.00) = $275.00 D 9 = $299,995 + ($10.00) = $301.00 

D 5 = $269,995 + ^ ($10.00) = $279.06 

Hence 10% of the employees earn $258.12 or less, 20% earn $265.00 or less, . . . , 90% earn $301.00 
or less. 

Note that the fifth decile is the median. The second, fourth, sixth, and eighth deciles, which divide 
the distribution into five equal parts and which are called quintiles, are sometimes used in practice. 
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3.45 Determine (a) the 35th and (b) the 60th percentiles for the distribution in Problem 3.44. 
SOLUTION 

(a) The 35th percentile, denoted by P 35 , is obtained by counting the first 35iV/100 = 35(65)/100 = 22.75 of 
the cases, beginning with the first (lowest) class. Then, as in Problem 3.44, 

P 35 = $269,995 + ^ ($10.00) = $272.97 
This means that 35% of the employees earn $272.97 or less. 

(b) The 60th percentile is P 60 = $279,995 + -^($10.00) = $283.57. Note that this is the same as the sixth 
decile or third quintile. 

3.46 The following EXCEL worksheet is contained in A1:D26. It contains the per capita income for 
the 50 states. Give the EXCEL commands to find Qi, Q2, Q3, and P 95 . Also, give the states that 
are on both sides those quartiles or percentiles. 



State 


Per capita income 


State 


Per capita income 


Wyoming 


36778 


Pennsylvania 


34897 


Montana 


29387 


Wisconsin 


33565 


North Dakota 


31395 


Massachusetts 


44289 


New Mexico 


27664 


Missouri 


31899 


West Virginia 


27215 


Idaho 


28158 


Rhode Island 


36153 


Kentucky 


28513 


Virginia 


38390 


Minnesota 


37373 


South Dakota 


31614 


Florida 


33219 


Alabama 


29136 


South Carolina 


28352 


Arkansas 


26874 


New York 


40507 


Maryland 


41760 


Indiana 


31276 


Iowa 


32315 


Connecticut 


47819 


Nebraska 


33616 


Ohio 


32478 


Hawaii 


34539 


New Hampshire 


38408 


Mississippi 


25318 


Texas 


32462 


Vermont 


33327 


Oregon 


32103 


Maine 


31252 


New Jersey 


43771 


Oklahoma 


29330 


California 


37036 


Delaware 


37065 


Colorado 


37946 


Alaska 


35612 


North Carolina 


30553 


Tennessee 


31107 


Illinois 


36120 


Kansas 


32836 


Michigan 


33116 


Arizona 


30267 


Washington 


35409 


Nevada 


35883 


Georgia 


31121 


Utah 


28061 


Louisiana 


24820 



=PERCENTILE(A2:D26,0.25) 
=PERCENTILE(A2:D26,0.50) 
=PERCENTILE(A2:D26,0.75) 
=PERCENTILE(A2:D26,0.95) 



$30338.5 
$32657 
$36144.75 
$42866.05 



Nearest states 

Arizona and North Carolina 
Ohio and Kansas 
Illinois and Rhode Island 
Maryland and New Jersey 
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SUMMATION NOTATION 

3.47 Write the terms in each of the following indicated sums: 
(a) £(A} + 2) (c) g £}(£} + 6) (e) 

(*) ilfjXj (d) J2(Y 2 k -4) 

3.48 Express each of the following by using the summation notation: 

(a) (X, + 3) 3 + (X 2 + 3) 3 + (Z 3 + 3) 3 

(b) /, (Y\ — a) 2 + f 2 (Y 2 — a) 2 + ■ ■ ■ + f l5 (Y l5 - a) 2 

(c) {2X X - 3F,) + (2X 2 - 3F 2 ) + • • • + (2X jV - 3Fjv) 

(d) (^/F, - l) 2 + (X 2 /Y 2 - l) 2 + • • • + (Z 8 /F 8 - l) 2 

, , /iflf + / 2 a 2 + --- + /i 2 qf 2 
K) A+fi + '-' + fn 

3.49 Prove that {Xj ~ l f = Ej^i X 2 -2 YjJI, Xj + N. 

3.50 Prove that YJ ( x + a)( Y + b) = YJ XY + aJ2 Y + b*£ X + Nab, where a and b are constants. What 
subscript notation is implied? 

3.51 Two variables, U and V, assume the values f7, =3, U 2 = -2, U 3 = 5, and V x = -4, F 2 = -1, K 3 = 6, 
respectively. Calculate (a) £ UV, (b) ]T ([/ + 3)(K - 4), (c) J] K 2 , (rf) (YJ [/)(£ F) 2 , (e) £ f/F 2 , 
(/) E - 2F 2 + 2), and (g) J2(U/V). 

3.52 Given YJ* =1 X } = 7, Y^ =] F ; = -3, and Y.U x jYj = 5, find (a) £; = i(2Jt) + 5F ; ) and (ft) E./=i(^, ~ 3) 
(21} +1). 

THE ARITHMETIC MEAN 

3.53 A student received grades of 85, 76, 93, 82, and 96 in five subjects. Determine the arithmetic mean of the grades. 

3.54 The reaction times of an individual to certain stimuli were measured by a psychologist to be 0.53, 0.46, 0.50, 
0.49, 0.52, 0.53, 0.44, and 0.55 seconds respectively. Determine the mean reaction time of the individual to 
the stimuli. 

3.55 A set of numbers consists of six 6's, seven 7's, eight 8's, nine 9's and ten 10's. What is the arithmetic mean of 
the numbers? 

3.56 A student's grades in the laboratory, lecture, and recitation parts of a physics course were 71, 78, and 89, 
respectively. 

(a) If the weights accorded these grades are 2, 4, and 5, respectively, what is an appropriate average grade? 

(b) What is the average grade if equal weights are used? 

3.57 Three teachers of economics reported mean examination grades of 79, 74, and 82 in their classes, which 
consisted of 32, 25, and 17 students, respectively. Determine the mean grade for all the classes. 
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3.58 The mean annual salary paid to all employees in a company is $36,000. The mean annual salaries paid to 
male and female employees of the company is $34,000 and $40,000 respectively. Determine the percentages 
of males and females employed by the company. 

3.59 Table 3.8 shows the distribution of the maximum loads in short tons (1 short ton = 2000 lb) supported by 
certain cables produced by a company. Determine the mean maximum loading, using (a) the long method 
and (b) the coding method. 



Table 3.8 



Maximum Load 
(short tons) 


Number of 
Cables 


9.3-9.7 




9.8-10.2 


5 


10.3-10.7 


12 


10.8-11.2 


17 


11.3-11.7 


14 


11.8-12.2 


6 


12.3-12.7 


3 


12.8-13.2 


1 




Total 60 



3.60 Find X for the data in Table 3.9, using (a) the long method and (b) the coding method. 



Table 3.9 



X 


462 


480 498 516 534 552 570 


588 606 624 


f 


98 


75 56 42 30 21 15 


116 2 



3.61 Table 3.10 shows the distribution of the diameters of the heads of rivets manufactured by a company. 
Compute the mean diameter. 

3.62 Compute the mean for the data in Table 3.11. 



Diameter (cm) 


Frequency 


0.7247-0.7249 


2 


0.7250-0.7252 


6 


0.7253-0.7255 


8 


0.7256-0.7258 


15 


0.7259-0.7261 


42 


0.7262-0.7264 


68 


0.7265-0.7267 


49 


0.7268-0.7270 


25 


0.7271-0.7273 


18 


0.7274-0.7276 


12 


0.7277-0.7279 


4 


0.7280-0.7282 


1 




Total 250 



Class 


Frequency 


10 to under 15 


3 


15 to under 20 


7 


20 to under 25 


16 


25 to under 30 


12 


30 to under 35 


9 


35 to under 40 


5 


40 to under 45 


2 




Total 54 
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3.63 Compute the mean TV viewing time for the 400 junior high students per week in Problem 2.20. 

3.64 (a) Use the frequency distribution obtained in Problem 2.27 to compute the mean diameter of the ball 

bearings. 

(b) Compute the mean directly from the raw data and compare it with part (a), explaining any discrepancy. 
THE MEDIAN 



3.65 Find the mean and median of these sets of numbers: (a) 5, 4, 8, 3, 7, 2, 9 and (b) 18.3, 20.6, 19.3, 22.4, 20.2, 
18.8, 19.7, 20.0. 

3.66 Find the median grade of Problem 3.53. 

3.67 Find the median reaction time of Problem 3.54. 

3.68 Find the median of the set of numbers in Problem 3.55. 

3.69 Find the median of the maximum loads of the cables in Table 3.8 of Problem 3.59. 

3.70 Find the median X for the distribution in Table 3.9 of Problem 3.60. 

3.71 Find the median diameter of the rivet heads in Table 3.10 of Problem 3.61. 

3.72 Find the median of the distribution in Table 3.1 1 of Problem 3.62. 

3.73 Table 3.12 gives the number of deaths in thousands due to heart disease in 1993. Find the median age for 
individuals dying from heart disease in 1993. 



Table 3.12 



Age Group 


Thousands of Deaths 


Total 


743.3 


Under 1 


0.7 


1 to 4 


0.3 


5 to 14 


0.3 


15 to 24 


1.0 


25 to 34 


3.5 


35 to 44 


13.1 


45 to 54 


32.7 


55 to 64 


72.0 


65 to 74 


158.1 


75 to 84 


234.0 


85 and over 


227.6 



Source: U.S. National Center for Health Statistics, Vital 
Statistics of the U.S., annual. 



3.74 Find the median age for the U.S. using the data for Problem 2.31. 

3.75 Find the median TV viewing time for the 400 junior high students in Problem 2.20. 
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THE MODE 

3.76 Find the mean, median, and mode for each set of numbers: (a) 7, 4, 10, 9, 15, 12, 7, 9, 7 and (b) 8, 11, 4, 3, 2, 
5, 10, 6, 4, 1, 10, 8, 12, 6, 5, 7. 

3.77 Find the modal grade in Problem 3.53. 

3.78 Find the modal reaction time in Problem 3.54. 

3.79 Find the mode of the set of numbers in Problem 3.55. 

3.80 Find the mode of the maximum loads of the cables of Problem 3.59. 

3.81 Find the mode X for the distribution in Table 3.9 of Problem 3.60. 

3.82 Find the modal diameter of the rivet heads in Table 3.10 of Problem 3.61. 

3.83 Find the mode of the distribution in Problem 3.62. 

3.84 Find the modal TV viewing time for the 400 junior high students in Problem 2.20. 

3.85 (a) What is the modal age group in Table 2.15? 
(b) What is the modal age group in Table 3.12? 

3.86 Using formulas (9) and (10) in this chapter, find the mode of the distributions given in the following 
Problems. Compare your answers obtained in using the two formulas. 

(a) Problem 3.59 (b) Problem 3.61 (c) Problem 3.62 (d) Problem 2.20. 

3.87 A continuous random variable has probability associated with it that is described by the following prob- 
ability density function. f(x) = -0.7 5x 2 + l.5x for < x < 2 and f(x) = for all other x values. The mode 
occurs where the function attains a maximum. Use your knowledge of quadratic functions to show that the 
mode occurs when x = 1 . 

THE GEOMETRIC MEAN 

3.88 Find the geometric mean of the numbers (a) 4.2 and 16.8 and (b) 3.00 and 6.00. 

3.89 Find (a) the geometric mean G and (b) the arithmetic mean X of the set 2, 4, 8, 16, 32. 

3.90 Find the geometric mean of the sets (a) 3, 5, 8, 3, 7, 2 and (b) 28.5, 73.6, 47.2, 31.5, 64.8. 

3.91 Find the geometric mean for the distributions in (a) Problem 3.59 and (b) Problem 3.60. Verify that the 
geometric mean is less than or equal to the arithmetic mean for these cases. 

3.92 If the price of a commodity doubles in a period of 4 years, what is the average percentage increase 
per year? 
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3.93 In 1980 and 1996 the population of the United States was 226.5 million and 266.0 million, respectively. 
Using the formula given in Problem 3.38, answer the following: 

(a) What was the average percentage increase per year? 

(b) Estimate the population in 1985. 

(c) If the average percentage increase of population per year from 1996 to 2000 is the same as in part (a), 
what would the population be in 2000? 

3.94 A principal of $1000 is invested at an 8% annual rate of interest. What will the total amount be after 6 years 
if the original principal is not withdrawn? 

3.95 If in Problem 3.94 the interest is compounded quarterly (i.e., there is a 2% increase in the money every 
3 months), what will the total amount be after 6 years? 



3.96 Find two numbers whose arithmetic mean is 9.0 and whose geometric mean is 7.2. 



THE HARMONIC MEAN 



3.97 Find the harmonic mean of the numbers (a) 2, 3, and 6 and (b) 3.2, 5.2, 4.8, 6.1, and 4.2. 

3.98 Find the (a) arithmetic mean, (b) geometric mean, and (c) harmonic mean of the numbers 0, 2, 4, and 6. 

3.99 If X u X 2 , X 3 , . . . represent the class marks in a frequency distribution with corresponding class frequencies 
fu fi, h,---, respectively, prove that the harmonic mean H of the distribution is given by 

H N\X X X 2 X 3 J X 

where N = f x + f 2 + ■ ■ ■ = £ /. 

3.100 Use Problem 3.99 to find the harmonic mean of the distributions in (a) Problem 3.59 and (b) Problem 3.60. 
Compare with Problem 3.91. 

3.101 Cities A, B, and C are equidistant from each other. A motorist travels from A to B at 30mi/h, from B to C at 
40mi/h, and from C to A at 50mi/h. Determine his average speed for the entire trip. 

3.102 (a) An airplane inn els distances of d\, d 2 , and d 3 miles at speeds v 2 , and vj mi/h, respectively. Show that 

the average speed is given by V, where 

d,+d 2 + d 3 d, d 2 d 3 
^ vi + v 2 + v 3 

This is a weighted harmonic mean. 
(b) Find V if d x = 2500, d 2 = 1200, d 3 = 500, w, = 500, v 2 = 400, and v 3 = 250. 



3.103 Prove that the geometric mean of two positive numbers a and b is (a) less than or equal to the arithmetic 
mean and (b) greater than or equal to the harmonic mean of the numbers. Can you extend the proof to more 
than two numbers? 



94 



MEASURES OF CENTRAL TENDENCY 



[CHAP. 3 



THE ROOT MEAN SQUARE, OR QUADRATIC MEAN 



3.104 Find the RMS (or quadratic mean) of the numbers (a) 11, 23, and 35 and (b) 2.7, 3.8, 3.2, and 4.3. 

3.105 Show that the RMS of two positive numbers, a and b, is (a) greater than or equal to the arithmetic mean and 
(b) greater than or equal to the harmonic mean. Can you extend the proof to more than two numbers? 

3.106 Derive a formula that can be used to find the RMS for grouped data and apply it to one of the frequency 
distributions already considered. 



QUARTILES, DECILES, AND PERCENTILES 



3.107 Table 3.13 shows a frequency distribution of grades on a final examination in college algebra, (a) Find the 
quartiles of the distribution, and (b) interpret clearly the significance of each. 



Grade 

90-100 
80-89 
70-79 
60-69 
50-59 
40-49 
30-39 



3.108 Find the quartiles Q u Q 2 , and Q 3 for the distributions in (a) Problem 3.59 and (b) Problem 3.60. Interpret 
clearly the significance of each. 

3.109 Give six different statistical terms for the balance point or central value of a bell-shaped 
frequency curve. 

3.110 Find (a) P l0 , (b) P 90 , (c) P 25 , and (d) P 15 for the data of Problem 3.59, interpreting clearly the significance 
of each. 

3.111 (a) Can all quartiles and deciles be expressed as percentiles? Explain. 
(b) Can all quantiles be expressed as percentiles? Explain. 

3.112 For the data of Problem 3.107, determine (a) the lowest grade scored by the top 25% of the class and (b) the 
highest grade scored by the lowest 20% of the class. Interpret your answers in terms of percentiles. 

3.113 Interpret the results of Problem 3.107 graphically by using (a) a percentage histogram, (b) a percentage 
frequency polygon, and (c) a percentage ogive. 

3.114 Answer Problem 3 . 1 1 3 for the results of Problem 3.108. 



3.115 (a) Develop a formula, similar to that of equation (8) in this chapter, for computing any percentile from a 
frequency distribution. 

(b) Illustrate the use of the formula by applying it to obtain the results of Problem 3.110. 



The Standard Deviation 
I and Other Measures 
H of Dispersion 



DISPERSION, OR VARIATION 

The degree to which numerical data tend to spread about an average value is called the dispersion, 
or variation, of the data. Various measures of this dispersion (or variation) are available, the most 
common being the range, mean deviation, semi-interquartile range, 10-90 percentile range, and standard 
deviation. 



THE RANGE 

The range of a set of numbers is the difference between the largest and smallest numbers in the set. 

EXAMPLE 1 . The range of the set 2, 3, 3, 5, 5, 5, 8, 10, 12 is 12 - 2 = 10. Sometimes the range is given by simply 
quoting the smallest and largest numbers; in the above set, for instance, the range could be indicated as 2 to 12, 
or 2-12. 



THE MEAN DEVIATION 

The mean deviation, or average deviation, of a set of N numbers X X ,X 2 , ■ ■ ■ ,X N is abbreviated MD 
and is defined by 

N 

y \Xj - x\ 

f-f VIZ-ZI 

Mean deviation (MD) = = 1 = \X-X\ (1) 

where X is the arithmetic mean of the numbers and \Xj — X\ is the absolute value of the deviation of Xj 
from X. (The absolute value of a number is the number without the associated sign and is indicated by 
two vertical lines placed around the number; thus | - 4| = 4, | + 3| = 3, |6| = 6, and | - 0.84| = 0.84.) 

95 
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EXAMPLE 2. Find the mean deviation of the set 2, 3 

Arithmetic mean (X) 

MO = 12 ~ 61 + 13 ~ 61 + 16 ~ 61 + 18 ~ 61 + 11 1 ~ 61 : 



If X U X 2 ,...,X K occur with frequencies f\,f 2 , ■ ■ ■ ,/k, respectively, the mean deviation can be 
written as 

Y,m-x\ _ 

^ J ^- X ^ \X^Y\ (2) 

where N = J2f=i fj = Hf- This form is useful for grouped data, where the X/s represent class marks 
and the f/s are the corresponding class frequencies. 

Occasionally the mean deviation is defined in terms of absolute deviations from the median or other 
average instead of from the mean. An interesting property of the sum J2jLi \ x j ~ a \ i s th at h is a 
minimum when a is the median (i.e., the mean deviation about the median is a minimum). 

Note that it would be more appropriate to use the terminology mean absolute deviation than mean 
deviation. 



6, 8, 11. 

2 + 3 + 6 + 8+11 
= 5 = 6 

_ | - 4| + | - 3| + |0| + |2| + |5| _ 4 + 3 + + 2 + 5 



THE SEMI-INTERQUARTILE RANGE 

The semi-interquartile range, or quart He deviation, of a set of data is denoted by Q and is defined by 

e = ^ « 

where Q { and Q 3 are the first and third quartiles for the data (see Problems 4.6 and 4.7). The interquartile 
range Q 3 - Q x is sometimes used, but the semi-interquartile range is more common as a measure of 
dispersion. 



THE 10-90 PERCENTILE RANGE 

The 70-90 percentile range of a set of data is defined by 

10-90 percentile range = P w - P w (4) 

where P w and P 90 are the 10th and 90th percentiles for the data (see Problem 4.8). The semi- 10-90 
percentile range, ^(Pgo - Pw), can also be used but is not commonly employed. 



THE STANDARD DEVIATION 

The standard deviation of a set of N numbers X l ,X 2 ,...,X N is denoted by s and is defined by 

YiXi-X) 2 i , 

where x represents the deviations of each of the numbers Xj from the mean X. Thus s is the root mean 
square (RMS) of the deviations from the mean, or, as it is sometimes called, the root-mean-square 
deviation. 
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If X U X 2 , ■■■,X K occur with frequencies f l ,f 2 , • ••,/*:, respectively, the standard deviation can be 
written 

y ux, - xf , , 

where N — J2f=i fj — J2 f- in this form it is useful for grouped data. 

Sometimes the standard deviation of a sample's data is defined with (N - 1) replacing N in the 
denominators of the expressions in equations (5) and (6) because the resulting value represents a better 
estimate of the standard deviation of a population from which the sample is taken. For large values of N 
(certainly N > 30), there is practically no difference betwen the two definitions. Also, when the better 
estimate is needed we ca n always ob tain it by multiplying the standard deviation computed according to 
the first definition by y/N/(N - 1). Hence we shall adhere to the form (5) and (6). 



THE VARIANCE 

The variance of a set of data is defined as the square of the standard deviation and is thus given by s 
in equations (5) and (6). 

When it is necessary to distinguish the standard deviation of a population from the standard 
deviation of a sample drawn from this population, we often use the symbol s for the latter and a 
(lowercase Greek sigma) for the former. Thus s 2 and o 2 would represent the sample variance and 
population variance, respectively. 



SHORT METHODS FOR COMPUTING THE STANDARD DEVIATION 

Equations (5) and (6) can be written, respectively, in the equivalent forms 



/EfjXj E^ 



N 



\ N ) 



(8) 



where X 2 denotes the mean of the squares of the various values of X, while X 2 denotes the square of the 
mean of the various values of X (see Problems 4.12 to 4.14). 

If dj — Xj - A are the deviations of 2} from some arbitrary constant A, results (7) and (8) become, 
respectively, 



/E4 2 (E4 

N 



V N J 

~7 



Ed 2 (Ed\ 



E M 



\ N J 



(10) 



(See Problems 4.15 and 4.17.) 
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When data are grouped into a frequency distribution whose class intervals have equal size c, we have 
dj = cuj or Xj = A + cuj and result (70) becomes 



s = c 




This last formula provides a very short method for computing the standard deviation and should always 
be used for grouped data when the class-interval sizes are equal. It is called the coding method and 
is exactly analogous to that used in Chapter 3 for computing the arithmetic mean of grouped data. 
(See Problems 4.16 to 4.19.) 



PROPERTIES OF THE STANDARD DEVIATION 

1 . The standard deviation can be defined as 



" _ V TV 

where a is an average besides the arithmetic mean. Of all such standard deviations, the minimum 
is that for which a = X, because of Property 2 in Chapter 3. This property provides an 
important reason for defining the standard deviation as above. For a proof of this property, 
see Problem 4.27. 

For normal distributions (see Chapter 7), it turns out that (as shown in Fig. 4-1): 

(a) 68.27% of the cases are included between X-s and X + s (i.e., one standard deviation on 
either side of the mean). 

(b) 95.45% of the cases are included between X — 2s and X + 2s (i.e., two standard deviations 
on either side of the mean). 

(c) 99.73% of the cases are included between X — 3s and X + 3s (i.e., three standard deviations on 
either side of the mean). 

For moderately skewed distributions, the above percentages may hold approximately 
(see Problem 4.24). 

Suppose that two sets consisting of N { and N 2 numbers (or two frequency distributions with 
total frequencies N { and 7V 2 ) have variances given by .v, and si, respectively, and have the same 
mean X. Then the combined, or pooled, variance of both sets (or both frequency distributions) 
is given by 

Note that this is a weighted arithmetic mean of the variances. This result can be generalized to three 
or more sets. 

Chebyshev's theorem states that for k> 1, there is at least (1 - (l//c 2 )) x 100% of the probability 
distribution for any variable within k standard deviations of the mean. In particular, when k=2, 
there is at least (1 -(1/2 2 )) x 100% or 75% of the data in the interval (x-2S, x + 2S), when 
k= 3 there is at least (1 - (1/3 2 )) x 100% or 89% of the data in the interval (x -3S,x + 3S), 
and when k = 4 there is at least (1 -(1/4 2 )) x 100% or 93.75% of the data in the interval 
{x-AS, x + 4S). 
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Fig. 4-1 Illustration of the empirical rule. 

CHARLIER'S CHECK 

Charlier's check in computations of the mean and standard deviation by the coding method makes 
use of the identities 

£ /(« + i) = £/" + £/ = £ fu + n 

£ /(« + l) 2 = £ f(u 2 + 2u + 1) = £ fu 2 + 2 £ fu + £ / = £ ,/« 2 +2£ fu + N 
(See Problem 4.20.) 
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SHEPPARD'S CORRECTION FOR VARIANCE 

The computation of the standard deviation is somewhat in error as a result of grouping the data into 
classes (grouping error). To adjust for grouping error, we use the formula 

Corrected variance = variance from grouped data - ^ (13) 

where c is the class-interval size. The correction c /12 (which is subtracted) is called Sheppard's correc- 
tion. It is used for distributions of continuous variables where the "tails" go gradually to zero in both 
directions. 

Statisticians differ as to when and whether Sheppard's correction should be applied. It should 
certainly not be applied before one examines the situation thoroughly, for it often tends to overcorrect, 
thus replacing an old error with a new one. In this book, unless otherwise indicated, we shall not be using 
Sheppard's correction. 



EMPIRICAL RELATIONS BETWEEN MEASURES OF DISPERSION 

For moderately skewed distributions, we have the empirical formulas 

Mean deviation = | (standard deviation) 

Semi-interquartile range = § (standard deviation) 

These are consequences of the fact that for the normal distribution we find that the mean 
deviation and semi-interquartile range are equal, respectively, to 0.7979 and 0.6745 times the standard 
deviation. 



ABSOLUTE AND RELATIVE DISPERSION; COEFFICIENT OF VARIATION 

The actual variation, or dispersion, as determined from the standard deviation or other r 
of dispersion is called the absolute dispersion. However, a variation (or dispersion) of 10 inches 
(in) in measuring a distance of 1000 feet (ft) is quite different in effect from the same variation 
of 10 in in a distance of 20 ft. A measure of this effect is supplied by the relative dispersion, which 
is defined by 

„ , . , . . absolute dispersion 

Relative dispersion = (14) 

average 

If the absolute dispersion is the standard deviation s and if the average is the mean X, then the 
relative dispersion is called the coefficient of variation, or coefficient of dispersion; it is denoted by V and 
is given by 

(75) 

and is generally expressed as a percentage. Other possibilities also occur (see Problem 4.30). 

Note that the coefficient of variation is independent of the units used. For this reason, it is useful in 
comparing distributions where the units may be different. A disadvantage of the coefficient of variation 
is that it fails to be useful when X is close to zero. 
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STANDARDIZED VARIABLE; STANDARD SCORES 

The variable that measures the deviation from the mean in units of the standard deviation is called a 
standardized variable, is a dimensionless quantity (i.e., is independent of the units used), and is given by 

If the deviations from the mean are given in units of the standard deviation, they are said to be 
expressed in standard units, or standard scores. These are of great value in the comparison of distributions 
(see Problem 4.31). 

SOFTWARE AND MEASURES OF DISPERSION 

The statistical software gives a variety of measures for dispersion. The dispersion measures are 
usually given in descriptive statistics. EXCEL allows for the computation of all the measures discussed 
in this book. MINITAB and EXCEL are discussed here and outputs of other packages are given in the 
solved problems. 

EXAMPLE 3. 

(a) EXCEL provides calculations for several measures of dispersion. The following example illus- 
trates several. A survey was taken at a large company and the question asked was how many 
e-mails do you send per week? The results for 75 employees are shown in ALE15 of an EXCEL 
worksheet. 

32 113 70 60 84 
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The range is given by =MAX(A1:E15)-MIN(A1:E15) or 125-24= 101. The mean deviation 
or average deviation is given by =AVEDEV(A1:E15) or 24.42. The semi-inlerquartile range 
is given by the expression =(PERCENTILE(Al:E15,0.75)-PERCENTILE(Al:E15,0.25))/2 
or 22. The 10-90 percentile range is given by PERCENTILE(A1:E15,0.9)-PERCENTILE 
(A1:E15,0.1) or 82.6. 

The standard deviation and variance is given by =STDEV(A1:E15) or 29.2563, and 
=VAR(A1:E15) or 855.932 for samples and =STDEVP(A1:E15) or 29.0606 and 
=VARP(A1:E15) or 844.52 for populations. 



THE STANDARD DEVIATION AND OTHER MEASURES OF DISPERSION [CHAP. 4 




Fig. 4-2 Dialog box for MINITAB. 



The dialog box in Fig. 4-2 shows MINITAB choices for measures of dispersion and measures 
of central tendency. Output is as follows: 



Descriptive Statistics: e-mails 



Variable StDev Variance CoefVar Minimum Ql Q3 Maximum Range IQR 

e-mails 29.26 855.93 44.56 24.00 40.00 86.00 125.00 101.00 46.00 



THE RANGE 



Solved Problems 



Find the range of the sets (a) 12, 6, 7, 3, 15, 10, 18, 5 and (b) 9, 3, 8, 8, 9, 8, 9, 18. 



In both cases, range = 
arrays of sets (a) and (b), 



largest number - smallest number = 



(a) 3, 5, 6, 7, 10, 12, 15, 18 



(b) 3, 8, 8, 8, 9, 9, 9, 18 



there is much 

Since the range indi 



dispersion, in (a) than in (b). In fact, (b) consists mainly of 8's and 9's. 
difference between the sets, it is not a very good measure of dispersion in 
this case. Where extreme values are present, the range is generally a poor measure of dispersion. 

An improvement is achieved by throwing out the extreme cases, 3 and 18. Then for set (a) the range is 
(15 — 5) = 10, while for set (b) the range is (9 — 8) = 1, clearh showing thai (a) has greater dispersion than 
(b). However, this is not the way the range is defined. The semi-interquartile range and the 10-90 percentile 
range were designed to improve on the range by eliminating extreme cases. 



4.2 Find the range of heights of the students at XYZ University as given in Table 2.1. 
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SOLUTION 

There are two ways of defining the range for grouped data. 
First method 

Range = class mark of highest class - class mark of lowest class 
= 73-61 = 12 in 

Second method 

Range = upper class boundary of highest class - lower class boundary of lowest class 
= 74.5 - 59.5 = 15 in 
The first method tends to eliminate extreme cases to some extent. 

THE MEAN DEVIATION 

4.3 Find the mean deviation of the sets of numbers in Problem 4.1. 
SOLUTION 

(a) The arithmetic mean is 

12 + 6 + 7 + 3 + 15 + 10+ 18 + 5 _ 76 

X ~ 8 ~T 

The mean deviation is 
N 

1 12 — 9.5| + |6 - 9.5| + |7 - 9.5| + |3 - 9.5| + |15 - 9.5| + 1 10 - 9.5| + |18 - 9.5| + |5 - 9.5| 

8 

2.5 + 3.5 + 2.5 + 6.5 + 5.5 + 0.5 + 8.5 + 4.5 34 , „, 

= 8 = T = 425 

(b) ^ = 9 + 3 + 8 + 8 + 9 + 8 + 9 + 18 = 72 = 9 

8 " 8 

N 

_ [9 - 9J + |3 - 9| + |8 - 9| + |8 - 9| + |9 - 9| + |8 - 9| + [9 - 9J + |18 - 9| 
8 

_ + 6+1 + 1+0+1+0 + 9 _ 225 
The mean deviation indicates that set (b) shows less dispersion than set (a), as it should. 

4.4 Find the mean deviation of the heights of the 100 male students at XYZ University (see Table 3.2 
of Problem 3.20). 



From Problem 3.20, X = 67.45 in. The work can be arranged as in Table 4.1. It is also possible to devise 
a coding method for computing the mean deviation (see Problem 4.47). 
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Height (in) 


Class Mark (X) 


\X-X\ = pT- 67.45| 


Frequency (/) 


f\x-x\ 


60-62 


61 


6.45 


5 


32.25 


63-65 


64 


3.45 


18 


62.10 


66-68 
69-71 


67 


0.45 


42 
27 


18.90 


72-74 


73 




8 


44.40 




N = £ / = 100 


£ f\X-X\ = 226.50 



_ E f\X-X\ _ 226.50 _ 



4.5 Determine the percentage of the students' heights in Problem 4.4 that fall within the ranges 
(a) X ± MD, (b)X±2 MD, and (c) X ± 3 MD. 



SOLUTION 

(a) The range from 65.19 to 69.71 in is X ± MD = 67.45 ± 2.26. This range includes all individuals in the 
third class + ±(65.5 - 65.19) of the students in the second class + ±(69.71 - 68.5) of the students in the 
fourth class (since the class-interval size is 3 in, the upper class boundary of the second class is 65.5 in, 
and the lower class boundary of the fourth class is 68.5 in). The number of students in the range 
I±MD is 



42 



0.31 



-(18H 



- (27) = 42+ 1.86+ 10.89 = 54.75 



55 



which is 55% of the total. 
(b) The range from 62.93 to 71.97 in 
students in the range X ± 2 MD is 

10 / 62.93 - 62.5 V , 



5 X ± 2 MD = 67.45 ±2(2.26) = 67.45 ±4.52. The number of 
,o , o, , ^1.97-71.5\ 



3 



(8) = 85.67 



(c) The range from 60.67 to 74.23 in is X ±3 MD = 67.45 ± 3(2.26) = 67.45 ± 6.78. The number of 
students in the range X ± 3 MD is 



which is 97% of the total. 



THE SEMI-INTERQUARTILE RANGE 



4.6 Find the semi-interquartile range for the height distribution of the students at XYZ University 
(see Table 4.1 of Problem 4.4). 



The lower and upper quartiles are g, = 65.5 + ^(3) = 65.64 in and Q 3 = 68.5 + ±§(3) = 69.61 in, respec- 
tively, and the semi-interquartile range (or quartile deviation) is Q = j(Q 3 - Q x ) = ^ (69.61 - 65.64) = 1.98 in. 
Note that 50% of the cases lie between Q x and Q 3 (i.e., 50 students have heights between 65.64 and 69.61 in). 
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We can consider \ {Q\ + g 3 ) = 67.63 in to be a measure of central tendency (i.e., average height). 
It follows that 50% of the heights lie in the range 67.63 ± 1.98 in. 



4.7 Find the semi-interquartile range for the wages of the 65 employees at the P&R Company (se 
Table 2.5 of Problem 2.3). 



From Problem 3.44, Q x = $268.25 and Q 3 = $290.75. Thus the semi-interquartile range 
2 = ^(23 -Si) =^($290.75 -$268.25) = $11.25. Since 2(2i+23) = $279 - 50 ' we can conclude that 
50% of the employees earn wages lying in the range $279.50 ± $1 1.25. 

THE 10-90 PERCENTILE RANGE 

4.8 Find the 10-90 percentile range of the heights of the students at XYZ University (see Table 2.1). 
SOLUTION 

Here P m = 62.5 + ^(3) = 63.33 in, and P 90 = 68.5 + §f (3) = 71.27 in. Thus the 10-90 percentile range 
is P 90 -P 10 = 71.27- 63.33 = 7.94 in. Since \ (P 10 + P 90 ) = 67.30 in and i(P 90 - P 10 ) = 3.97in, we can 
conclude that 80% of the students have heights in the range 67.30 ± 3.97 in. 

THE STANDARD DEVIATION 

4.9 Find the standard deviation s of each set of numbers in Problem 4.1. 
SOLUTION 

(fl) ^ _ EI _ 12 + 6 + 7 + 3 + 15+10+18 + 5 _ 76 _ g $ 



UX-Xf 
V N 



/ (12-9.5) 2 + (6-9.5) 2 + (7-9.5) 2 + (3-9.5) 2 + (15-9.5) 2 + (10-9.5) 2 + (18-9.5) 2 + (5-9.5) 2 
V 8 
= V23775 = 4.87 

X = - 



9+3+8+8+9+ 



Ux-xf 

V N 



(9 - 9) 2 + (3 - 9) 2 + (8 - 9) 2 + (8 - 9) 2 + (9 - 9) 2 + (8 - 9) 2 + (9 - 9) 2 + (18 - 9) 2 
"V 8 
= v / T5 = 3.87 

The above results should be compared with those of Problem 4.3. It will be noted that the standard 
deviation does indicate that set (b) shows less dispersion than set (a). However, the effect is masked by the 
fact that extreme values affect the standard deviation much more than they affect the mean deviation. This is 
to be expected, of course, since the deviations are squared in computing the standard deviation. 
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4.10 The standard deviations of the two data sets given in Problem 4.1 were found using MINITAB 
and the results are shown below. Compare the answers with those obtained in Problem 4.9. 

MTB > print cl 
setl 

12 6 7 3 15 
MTB > print c2 
set2 

9 3 8 8 9 
MTB > standard deviation cl 

Column Standard Deviation 

Standard deviation of setl = 5.21 
MTB > standard deviation c2 

Column Standard Deviation 

Standard deviation of set2 = 4 . 14 



10 18 5 
8 9 18 



SOLUTION 

The MINITAB package uses the formula 

' V N-\ 

and therefore the standard deviations are not the same in Problems 4.9 and 4.10. The answers in Problem 
4.10 are obtainable from those in Problem 4.9 if we multiply those in Problem 4.9 by y/N/{N - 1). Since 
TV = 8 for both sets ^JN/(N- 1) = 1.069045, and for set 1, we have (1.069045) (4.87) = 5.21, the standard 
deviation given by MINITAB. Similarly, (1.069045) (3. 87) = 4.14, the standard deviation given for set 2 
by MINITAB. 

4.11 Find the standard deviation of the heights of the 100 male students at XYZ University 
(see Table 2.1). 

SOLUTION 

From Problem 3.15, 3.20, or 3.22, X = 67.45 in. The work can be a 
/ E f{X - X) 2 _ / 852.7500 _ 



Table 4.2 



Height (in) 


Class Mark (X) 


X - X = X - 67.45 


(X-Xf 


Frequency (/) 


f(X~Xf 


60-62 


61 


-6.45 


41.6025 


5 


208.0125 


63-65 


64 


-3.45 


11.9025 


18 


214.2450 


66-68 


67 


-0.45 


0.2025 


42 


8.5050 


69-71 


70 


2.55 


6.5025 


27 


175.5675 


72-74 


73 


5.55 


30.8025 


8 


246.4200 




TV = e / = ioo 


J2f(x-x) 2 

= 852.7500 



CHAP. 4] THE STANDARD DEVIATION AND OTHER MEASURES OF DISPERSION 



107 



COMPUTING THE STANDARD DEVIATIONS FROM GROUPED DATA 
4.12 (a) Prove that 



(b) Use the formula in part (a) to find the standard deviation of the set 12, 6, 7, 3, 15, 10, : 
SOLUTION 

(a) By definition, 



s= .YAx-xf 

" V N 

_ J2(x - xf _ J2(x 2 - lxx + X 2 ) _ E x 2 - 2X E x + N X 2 

~ N ~ N ~ N 

_ Y,x 2 y Y,x , yi _ Y,x 2 _ ^2 , -r>2 Y,x 2 ^ 2 

_j:x 2 (ex\ 2 



-*-*■-¥-(¥) 



Note that in the above summations we have used the abbreviated form, with X replacing Xj and 
with E replacing E ; =i- 

Another method 



s 2 = (X- X) 2 = X 2 - 2XX + X 2 = X 2 ~ 2XX + X 2 = X 2 - 2XX + X 2 = X 2 - X 2 
, w ^2 _ J2X 2 _ (12) 2 + (6) 2 + (7) 2 + (3) 2 + (15) 2 + (10) 2 + (18) 2 + (5) 2 _ 912 _ 

y_E^_ 12 + 6 + 7 + 3 + 15+10+18 + 5 _76_ 



Thus s=Vl 2 -I 2 = Vl 14 - 90.25 = V23.75 = 4.87 

This method should be compared with that of Problem 4.9(a). 

4.13 Modify the formula of Problem 4.12(a) to allow for frequencies corresponding to the various 
values of X. 

SOLUTION 

The appropriate modification is 



As in Problem 4.12(a), this can be established by starting with 

„_, fe f{x-xf 
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_ E /(* zlL = L f&l - 2 xx + x 2 ) ^ J2fx 2 -2xj: fx + x 2 j:f 

N N N 

_ E/y 2 _ 7 y E/>' + |2 = E fx 2 _ ir _ + p = E fx 2 _ y 2 



_ E/r 2 (TJx\ 

N \ N ) 



/ E fx 2 (T.f*\ 



Note that in the above summations we have used the abbreviated form, with X and / replacing Z ; 
and fj, E replacing E ; =i, and E ; =i ^ = ^V- 



4.14 Using the formula of Problem 4.13, find the standard deviation for the data in Table 4.2 of 
Problem 4.11. 

SOLUTION 

The work can be arranged as in Table 4.3, where X = (E fX)/N = 67.45 in, as obtained in Problem 
3.15. Note that this method, like that of Problem 4.11, entails much tedious computation. Problem 4.17 
shows how the coding method simplifies the calculations immensely. 



Table 4.3 



Height (in) 


Class Mark (X) 


X 2 


Frequency (/) 


fx 2 


60-62 


61 


3721 


5 


18,605 


63-65 


64 


4096 


18 


73,728 


66-68 


67 


4489 


42 


188,538 


69-71 


70 


4900 


27 


132,300 


72-74 


73 


5329 


8 


42,632 




N = E / = ioo 


E fX 2 = 455,803 



4.15 If d — X - A are the deviations of X from an arbitrary constant A, prove that 
N \ N J 

SOLUTION 

Since d = X - A, X = A + d, and X = A + d (see Problem 3. 18), then 

X-X={A + d)-(A + d) = d-d 

using the result of Problem 4.13 and replacing X and X with d and d, respectively. 
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Another method 

s 2 = {X ~ X) 2 = (d - df = d 
= di_2d 2 + d 2 = d I -d 2 
and the result follows on taking the positive square root. 

4.16 Show that if each class mark X in a frequency distribution having class intervals of equal size c is 
coded into a corresponding value u according to the relation X = A + cu, where A is a given class 
mark, then the standard deviation can be written as 




SOLUTION 

This follows at once from Problem 4. 1 5 since d = X — A = cu. Thus, since c is a constant, 
Another method 

We can also prove the result directly without using Problem 4.15. Since X = A + cu, X = A + cu, and 

X -X = c{u-u), then 

, 2 = (X-Xf = c\u-uf = c 2 (u 2 -2uu + u 2 ) = c 2 {V 2 - 2u 2 + u 2 ) = c 2 {V 2 - u 2 ) 

4.17 Find the standard deviation of the heights of the students at XYZ University (see Table 2.1) 
by using (a) the formula derived in Problem 4.15 and (b) the coding method of Problem 4.16. 

SOLUTION 

In Tables 4.4 and 4.5, A is arbitrarily chosen as being equal to the class mark 67. Note that in Table 4.4 
the deviations d = X — A are all multiples of the class-interval size c = 3. This factor is removed in Table 4.5. 
As a result, the computations in Table 4.5 are greaih simplified (compare them with those of Problems 
4.11 and 4.14). For this reason, the coding method should be used wherever possible. 

(a) See Table 4.4. 



Table 4.4 



Class Mark (X) 


d = X ~A 


Frequency (/) 


fd 


fd 2 


61 


-6 


5 


-30 


180 


64 


-3 


18 


-54 


162 


A 67 





42 








70 


3 


27 


81 


243 


73 


6 


8 


48 


288 




N = £ / = 100 


E fd = 45 


E fd 2 = 873 



2 -2dd + d 2 

JLf* (J2fd\ 2 

N \ N J 
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7 _/E/^ 2 (J2fd\ 2 _ /873 _ /45_\ 3 

' V n \ n J V 100 v 100 y 



(b) See Table 4.5. 



= V0275 = 2.92ir 



Class Mark (X) 


u X ~ A 


Frequency (/) 


/« 


A 2 


61 


-2 


5 


-10 


20 


64 


-1 


18 




18 


^^67 





42 








70 




27 


27 


27 


73 


2 


8 


16 


32 




N = £ / = 100 


E/«=15 


E /" 2 = 97 



= 3V0.9475 - 2.92ir 



4.18 Using coding methods, find (a) the mean and (b) the standard deviation for the wage distribution 
of the 65 employees at the P&R Company (see Table 2.5 of Problem 2.3). 

SOLUTION 

The work can be arranged simply, as shown in Table 4.6. 



(a) 



X = A + cu = A + c = $275.00 + ($10.00) ^ j = $279.77 



z 




/ 


/« 


fu 2 


$255.00 


-2 


8 


-16 


32 


265.00 


-1 


10 


-10 


10 


■+ 275.00 





16 








285.00 


1 


14 


14 


14 


295.00 


2 


10 


20 


40 


305.00 


3 


5 


15 


45 


315.00 


4 


2 


8 


32 




N = E / = 65 


E/" = 31 


E = 173 



4.19 Table 4.7 shows the IQ's of 480 school children at a certain elementary school. Using the coding 
method, find (a) the mean and (b) the standard deviation. 
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Table 4.7 



Class mark (X) 


70 74 78 82 86 90 94 98 102 106 110 114 118 122 126 


Frequency (/) 


4 9 16 28 45 66 85 72 54 38 27 18 11 5 2 



SOLUTION 

The intelligence quotient is 

^ _ mental age 

chronological age 

expressed as a percentage. For example, an 8-year-old child who (according to certain educational proce- 
dures) has a mentality equivalent to that of a 10-year-old child would have an IQ of 10/8 = 1.25 = 125%, or 
simply 125, the % sign being understood. 

To find the mean and standard deviation of the IQ's in Table 4.7, we can arrange the work as in 
Table 4.8. 

CHARLIER'S CHECK 

4.20 Use Charlier's check to help verify the computations of (a) the mean and (b) the standard 
deviation performed in Problem 4.19. 
SOLUTION 

To supply the required check, the columns of Table 4.9 are added to those of Table 4.8 (with the 
exception of column 2, which is repeated in Table 4.9 for convenience). 

(a) From Table 4.9, £ /(« + 1) = 716; from Table 4.8, £ fu + N = 236 + 480 = 716. This provides the 
required check on the mean. 



Table 4.8 



X 




/ 


fu 


fu 2 


70 


-6 


4 


-24 


144 


74 


-5 


9 


-45 


225 


78 


-4 


16 


-64 


256 


82 


-3 


28 


-84 


252 


86 


-2 


45 


-90 


180 


90 


-1 


66 


-66 


66 


94 





85 








98 


1 


72 


72 


72 


102 


2 


54 


108 


216 


106 


3 


38 


114 


342 


no 


4 


27 


108 


432 


114 


5 


18 


90 


450 


118 


6 


11 


66 


396 


122 


7 


5 


35 


245 


126 


8 




16 


128 




N = Y, f = 480 


£ fu = 236 


E fu 2 = 3404 
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Table 4.9 



u + 1 


/ 


/("+!) 


/("+ I) 2 


-5 


4 


-20 


100 


-4 


9 


-36 


144 


-3 


16 


-48 


144 


-2 


28 


-56 


112 


-1 


45 


-45 


45 





66 








1 


85 


85 


85 


2 


72 


144 


288 


3 


54 


162 


486 


4 


38 


152 


608 


5 


27 


135 


675 


6 


18 


108 


648 


7 


11 


77 


539 


8 


5 


40 


320 


9 


2 


18 


162 




7Y = £ / = 480 


£ /(«+!) = 716 


£/>+l) 2 = 4356 



(b) From Table 4.9, £ /(« + l) 2 = 4356; from Table 4.8, £ /w 2 + 2 + N = 3404 + 2(236)+ 
480 = 4356. This provides the required check on the standard deviation. 



SHEPPARD'S CORRECTION FOR VARIANCE 

4.21 Apply Sheppard's correction to determine the standard deviation of the data in (a) Problem 4.17, 
(b) Problem 4.18, and (c) Problem 4.19. 

SOLUTION 

3 2 /12 = 7.7775. Corrected 



10 2 /12 = 23 5.08. Corrected 
4 2 /12 = 108.27. Corrected 



(a) s 2 = 8.5275, and c = 3. Corrected variance = .s 2 - c 2 / 12 = 8.5275 - 
standard deviation = %/correct variance = a/7.7775 = 2.79 in. 

(b) .s- 2 = 243.41, and c = 10. Corrected variance = .s 2 - c 2 / 12 = 243.41 - 
standard deviation = V235.08 = $15.33. 

(c) s 2 = 109.60, and c = 4. Corrected variance = i - c 2 /12 = 109.60 - 
standard deviation = V108.27 = 10.41. 



4.22 For the second frequency distribution of Problem 2.8, find (a) the mean, (b) the standard devia- 
tion, (c) the standard deviation using Sheppard's correction, and (d) the actual standard deviation 
from the ungrouped data. 

SOLUTION 

The work is arranged in Table 4.10. 

(a) x = A + cu = A + c^J^= 14 9 + 9(^) = 147.01b 

(c) Corrected variance = .s 2 - c 2 /12 = 188.27 - 9 2 /12 = 181.52. Corrected standard deviation = 13.5 lb. 



CHAP. 4] THE STANDARD DEVIATION AND OTHER MEASURES OF DISPERSION 



x 




/ 


/" 


fu 2 


122 


-3 


3 


-9 


27 


131 




5 


-10 


20 


140 


-1 


9 


-9 


9 


149 





12 








158 


1 


5 


5 


5 


167 


2 


4 




16 


176 


3 


2 


6 


18 




JV = E / = 40 


E /« = -9 


E /" 2 = 95 



(d) To compute the standard deviation from the actual weights of the students given in the problem, 
it is convenient first to subtract a suitable number, say A = 1 50 lb, from each weight and then 
use the method of Problem 4.15. The deviations d = X — A = X— 150 are then given in the 
following table: 



14 





-10 
-12 



-18 



-25 



-1 



-5 



-15 



from which we find that E d = - 128 and E d 2 = 7052. Then 



Hence Sheppard's correction supplied some improvement i 



— = Vl66.06 = 12.91b 
40 J 

this case. 



EMPIRICAL RELATIONS BETWEEN MEASURES OF DISPERSION 

4.23 For the distribution of the heights of the students at XYZ University, discuss the validity of the 
empirical formulas (a) mean deviation = | (standard deviation) and (b) semi-interquartile 
range = | (standard deviation). 



(a) From Problems 4.4 and 4.11, mean deviation standard deviation - 2.26/2.92 = 0.77, which is 
close to |. 

(b) From Problems 4.6 and 4.1 1, semi-interquartile range standard deviation = 1.98/2.92 = 0.68, which 
is close to |. 

Thus the empirical formulas are valid in this case. 

Note that in the above we have not used the standard deviation with Sheppard's correction for group- 
ing, since no corresponding correction has been made for the mean deviation or semi-interquartile range. 



PROPERTIES OF THE STANDARD DEVIATION 



4.24 Determine the percentage of the students' IQ's in Problem 4.19 that fall within the ranges 
(a) X ± .v, (b) X ± 2s, and (c) X ± 3s. 
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SOLUTION 

(a) The range of IQ's from 85.5 to 106.4 is X ± s = 95.97 ± 10.47. The number of IQ's in the range X ± s is 

( ^ ) (45) + 66 + 85 + 72 + 54 + ( "6.4-104 ) (3g) = ^ 

The percentage of IQ's in the range X ± s is 339/480 = 70.6%. 

(b) The range of IQ's from 75.0 to 1 16.9 is X ± 2s = 95.97 ± 2(10.47). The number of IQ's in the range 
X ± 2s is 

( 76- 75.0 ) (9) + 16 + 28 + 45 + 66 + 85 + 72 + 54 + 38 + 27 + 18 + ( 116 - 9 ~ 116 )(n) = 451 

The percentage of IQ's in the range X ± 2s is 451/480 = 94.0%. 

(c) The range of IQ's from 64.6 to 127.4 is X ± 3* = 95.97 ± 3(10.47). The number of IQ's in the range 
X ± 3s is 

480- ( 128 - 127 - 4 ) (2) = 479.7 or 480 

The percentage of IQ's in the range X ± 3s is 479.7/480 = 99.9%, or practically 100%. 

The percentages in parts (a), (b), and (c) agree favorably with those to be expected for a normal 
distribution: 68.27%, 95.45%, and 99.73%, respectively. 

Note that we have not used Sheppard's correction for the standard deviation. If this is used, the results 
in this case agree closely with the above. Note also that the above results can also be obtained by using 
Table 4.11 of Problem 4.32. 

4.25 Given the sets 2, 5, 8, 1 1, 14, and 2, 8, 14, find (a) the mean of each set, (b) the variance of each set, 
(c) the mean of the combined (or pooled) sets, and (d) the variance of the combined sets. 



SOLUTION 

(a) Mean of first set = A (2 + 5 + 8 + 1 1 + 14) = 8. Mean of second set = i (2 + 8 + 14) = 8. 

(b) Variance of first set = s\ = \ [(2 - 8) 2 + (5 - 8) 2 + (8 - 8) 2 + (1 1 - 8) 2 + (14 - 8) 2 ] = 18. Variance of 
second set = s\ = i [(2 - 8) 2 + (8 - 8) 2 + ( 14 - 8) 2 ] = 24. 

(c) The mean of the combined sets is 

2 + 5 + 8+11 + 14 + 2 + 8+14 

5T3 = 8 

(d) The variance of the combined sets is 

r = (2 - 8) 2 + (5 - 8) 2 + (8 - 8) 2 + (1 1 - 8) 2 + (14 - 8) 2 + (2 - 8) 2 + (8 - 8) 2 + (14 - 8) 2 = 
5 + 3 

Another method (by formula) 



2 __ Nifi + N 2 s 2 2 = (5)(18) + (3)(24) _ 
TV, + 7V 2 5 + 3 
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4.26 Work Problem 4.25 for the sets 2, 5, 8, 11, 14 and 10, 16, 22. 



Here the means of the two sets are 8 and 16, respectively, while ihe variances are the s< 
the preceding problem, namely, sj = 18 and si = 24. 



2 _ (2- ll) 2 + (5- ll) 2 + (8- ll) 2 + (11 - ll) 2 + (14- U) 2 + (10- ll) 2 + (16- ll) 2 + (22- ll) 2 
5 + 3 

= 35.25 
Note that the formula 

2 _ JV 2 + N 2 S2 

S N t +N 2 

which gives the value 20.25, is not applicable in this case since the means of the two sets are not the same. 

4.27 (a) Prove that w 2 + pw + q, where p and q are given constants, is a minimum if and only if 
w = -\p. 
(b) Using part (a), prove that 

N 

is a minimum if and only if a = X. 



(a) We have w 2 +pw + q = (w + jp) 2 + q — \p 2 ■ Since (q - \p 2 ) is a constant, the expression has the least 
value (i.e., is a minimum) if and only if w + \p = Q (i.e., w = —\p). 

UX-af = U^ 2 ^2aX + a 2 ) ^ J2X 2 -2aj:X + Na 2 ^ 2 J2X ^X 2 
{ ' N N N N N 

Comparing this last expression with (w 2 + pw + q), we have 

... 

Thus the expression is a minimum when a = ~\p = {J^X) /N = X, using the result of part (a). 



ABSOLUTE AND RELATIVE DISPERSION; COEFFICIENT OF VARIATION 

4.28 A manufacturer of television tubes has two types of tubes, A and B. Respectively, the tubes have 
mean lifetimes of X A = 1495 hours and X B = 1875 hours, and standard deviations of s A = 280 
hours and s B = 310 hours. Which tube has the greater (a) absolute dispersion and (b) relative 
dispersion? 
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(a) The absolute dispersion of A is s A = 280 hours, and of B is s B = 3 10 hours. Thus tube B has the greater 
absolute dispersion. 

(b) The coefficients of variation are 

Thus tube A has the greater relative 



4.29 Find the coefficients of variation, V, for the data of (a) Problem 4.14 and (b) Problem 4.18. using 
both uncorrected and corrected standard deviations. 

SOLUTION 

(a) F(uncorrected) = -<^^ d ) = |^ = 0.0433 = 4.3% 

F(corrected) = ^ cor *: cted ) = = . 0413 = AA% by p ro bl em 4.21(a) 



X 79.77 
^(corrected) _ 15.33 _ 
X ~ 79.77 ~ 



67.45 

l 

0.196= 19.6% 



by Problem 4.21 (ft) 



4.30 (a) Define a measure of relative dispersion that could be used for a set of data for which the 
quartiles are known. 

(b) Illustrate the calculation of the measure defined in part (a) by using the data of Problem 4.6. 



SOLUTION 

(a) If gi and Q 3 are given for a set of data, then \ (g, + g 3 ) 

average, while Q = \{Qi~Q\), the semi-interquartile range, is 
We can thus define a measure of relative dispersion as 

^(63-61) ^ 63-6! 
Q ^(61 + 63) 63 + 6! 
which we call the quartile coefficient of variation, or quurliie eoefj 
= 63 -6, ^ 69.61 - 65.64 ^ 3.97 
[ ' Q 63 + 61 69.61 +65.64 135.25 



of the data's central tendency, or 
of the data's dispersion. 



•oefficient of relative dispersion. 
0.0293 = 2.9% 



STANDARDIZED VARIABLE; STANDARD SCORES 



4.31 A student received a grade of 84 on a final examination in mathematics for which the mean grade 
was 76 and the standard deviation was 10. On the final examination in physics, for which the 
mean grade was 82 and the standard deviation was 16, she received a grade of 90. In which subject 
was her relative standing higher? 

SOLUTION 

The standardized variable z — (X - X)/s measures the deviation of X from the mean X in terms of 
standard deviation s. For mathematics, z = (84 - 76)/10 = 0.8; for physics, z = (90 - 82)/16 = 0.5. Thus 
the student had a grade 0.8 of a standard deviation above the mean in mathematics, but only 0.5 of a 
standard deviation above the mean in physics. Thus her relative standing was higher in mathematics. 

The variable z = (X - X) /s is often used in educational testing, where it is known as a standard score. 
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SOFTWARE AND MEASURES OF DISPERSION 

4.32 The STATISTIX analysis of the data in Example 3 of this chapter gave the following output. 
Statistix 8.0 
Descriptive Statistics 

Variable SD Variance C.V. MAD 

e-mails 29.256 855.93 44.562 21.000 

The MAD value is the median absolute deviation. It is the median value of the absolute 
differences among the individual values and the sample median. Confirm that the MAD value for this 
data equals 21. 

SOLUTION 

The sorted original data are: 



24 24 


24 


25 


26 


28 


29 


30 


31 


31 


31 


32 


32 




35 35 


36 


39 


40 


40 


42 


42 


44 


44 


45 


47 


49 




51 52 


54 


54 


54 


57 


58 


58 


58 


60 


61 


61 


63 




65 65 


68 


69 


70 


71 


71 


74 


74 


74 


77 


77 


77 




77 79 


80 


84 


86 


86 


95 


95 


99 


99 


100 


102 


102 




105 113 


113 


114 


115 


116 


118 


121 


122 


125 










The median of the original data is 61. 


















If 61 is subtracted from each value, the data become: 














-37 -37 


-37 


-36 


-35 


-33 


-32 


-31 


-30 


-30 


-30 


-29 


-29 




-26 -26 


-25 


-22 - 


-21 


-21 


-19 


-19 


-17 


-17 


-16 


-14 


-12 




-10 -9 


-7 


-7 


-7 


-4 


-3 


-3 


-3 


-1 








2 




4 4 


7 


8 


9 


10 


10 


13 


13 


13 


16 


16 


16 




16 18 


19 


23 


25 


25 


34 


34 


38 


38 


39 


41 


41 




44 52 


52 


53 


54 


55 


57 


60 


61 


64 










Now, take the absolute value of each of these values: 














37 37 


37 


36 35 


33 


32 


31 


30 


30 


30 


29 


29 


26 


26 


25 22 


21 


21 19 


19 


17 


17 


16 


14 


12 


10 


9 


7 


7 


7 4 


3 


3 3 


1 








2 


4 


4 


7 


8 


9 


10 


10 13 


13 


13 16 


16 


16 


16 


18 


19 


23 


25 


25 


34 


34 


38 38 


39 


41 41 


44 


52 


52 


53 


54 


55 


57 


60 


61 


64 



The median of this last set of data is 21. Therefore MAD = 21. 
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Supplementary Problems 

THE RANGE 

4.33 Find the range of the sets (a) 5, 3, 8, 4, 7, 6, 12, 4, 3 and (b) 8.772, 6.453, 10.624, 8.628, 9.434, 6.351. 

4.34 Find the range of the maximum loads given in Table 3.8 of Problem 3.59. 

4.35 Find the range of the rivet diameters in Table 3.10 of Problem 3.61. 

4.36 The largest of 50 measurements is 8.34 kilograms (kg). If the range is 0.46 kg, find the smallest measurement. 

4.37 The following table gives the number of weeks needed to find a job for 25 older workers that lost their jobs 
as a result of corporation downsizing. Find the range of the data. 

13 13 17 7 22 
22 26 17 13 14 
16 7 6 18 20 
10 17 11 10 15 
16 8 16 21 11 

THE MEAN DEVIATION 

4.38 Find the absolute values of (a) -18.2, (b) +3.58, (c) 6.21, (d) 0, (e) -s/l, and (/) 4.00 - 2.36 - 3.52. 

4.39 Find the mean deviation of the set (a) 3, 7, 9, 5 and (b) 2.4, 1.6, 3.8, 4.1, 3.4. 

4.40 Find the mean deviation of the sets of numbers in Problem 4.33. 

4.41 Find the mean deviation of the maximum loads in Table 3.8 of Problem 3.59. 

4.42 (a) Find the mean deviation (MD) of the rivet diameters in Table 3.10 of Problem 3.61. 

(b) What percentage of the rivet diameters lie between (X ± MD), (X ± 2MD), and (X ± 3 MD)? 

4.43 For the set 8, 10, 9, 12, 4, 8, 2, find the mean deviation (a) from the mean and (b) from the median. Verify 
that the mean deviation from the median is not greater than the mean deviation from the mean. 

4.44 For the distribution in Table 3.9 of Problem 3.60, find the mean deviation (a) about the mean and (b) about 
the median. Use the results of Problems 3.60 and 3.70. 

4.45 For the distribution in Table 3. 1 1 of Problem 3.62, find the mean deviation (a) about the mean and (b) about 
the median. Use the results of Problems 3.62 and 3.72. 
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4.46 Find the mean deviation for the data given in Problem 4.37. 

4.47 Derive coding formulas for computing the mean deviation (a) about the mean and (b) about the median 
from a frequency distribution. Apply these formulas to verify the results of Problems 4.44 and 4.45. 



THE SEMI-INTERQUARTILE RANGE 

4.48 Find the semi-interquartile range for the distributions of (a) Problem 3.59, (b) Problem 3.60, and (c) Problem 
3.107. Interpret the results clearly in each case. 

4.49 Find the semi-interquartile range for the data given in Problem 4.37. 

4.50 Prove that for any frequency distribution the total percentage of cases falling in the interval 
l(Gi + 63) ±2(23 - 20 is 50%. Is the same true for the interval Q 2 ±\{Qi - 20? Explain your answer. 

4.51 (a) How would you graph the semi-interquartile range corresponding to a given frequency distribution? 

(b) What is the relationship of the semi-interquartile range to the ogive of the distribution? 

THE 10-90 PERCENTILE RANGE 

4.52 Find the 10-90 percentile range for the distributions of (a) Problem 3.59 and (b) Problem 3. 107. Interpret the 
results clearly in each case. 

4.53 The tenth percentile for home selling prices in a city is $35,500 and the ninetieth percentile for home selling 
prices in the same city is $225,000. Find the 10-90 percentile range and give a range within which 80% 
of the selling prices fall. 

4.54 What advantages or disadvantages would a 20-80 percentile range have in comparison to a 10-90 percentile 
range? 

4.55 Answer Problem 4.51 with reference to the (a) 10-90 percentile range, (b) 20-80 percentile range, and 

(c) 25-75 percentile range. What is the relationship between (c) and the semi-interquartile range? 



THE STANDARD DEVIATION 

4.56 Find the standard deviation of the sets (a) 3, 6, 2, 1, 7, 5; (b) 3.2, 4.6, 2.8, 5.2, 4.4; and (c) 0, 0, 0, 0, 0, 1, 1, 1. 

4.57 (a) By adding 5 to each of the numbers in the set 3, 6, 2, 1, 7, 5, we obtain the set 8, 11, 7, 6, 12, 10. 

Show that the two sets have the same standard deviation but different means. How are the means 
related? 

(b) By multiplying each of the numbers 3, 6, 2, 1, 7, and 5 by 2 and then adding 5, we obtain the set 
1 1, 17, 9, 7, 19, 15. What is the relationship between the standard deviations and the means for the two 
sets? 

(c) What properties of the mean and standard deviation are illustrated by the particular sets of numbers in 
parts (a) and (/>)? 
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4.58 Find the standard deviation of the set of numbers in the arithmetic progression 4, 10, 16, 22, ... , 154. 

4.59 Find the standard deviation for the distributions of (a) Problem 3.59, (b) Problem 3.60, and (c) Problem 
3.107. 

4.60 Demonstrate the use of Charlier's check in each part of Problem 4.59. 

4.61 Find (a) the mean and (b) the standard deviation for the distribution of Problem 2.17, and explain the 
significance of the results obtained. 

4.62 When data have a bell-shaped distribution, the standard deviation may be approximated by dividing the 
range by 4. For the data given in Problem 4.37, compute the standard deviation and compare it with the 
range divided by 4. 

4.63 (a) Find the standard deviation s of the rivet diameters in Table 3.10 of Problem 3.61. 

(b) What percentage of the rivet diameters lies between X ± s, X ± 2s, and X ± 3.?? 

(c) Compare the percentages in part (b) with those which would theoretically be expected if the distribution 
were normal, and account for any observed differences. 

4.64 Apply Sheppard's correction to each standard deviation in Problem 4.59. In each case, discuss whether such 
application is or is not justified. 

4.65 What modifications occur in Problem 4.63 when Sheppard's correction is applied? 

4.66 (a) Find the mean and standard deviation for the data of Problem 2.8. 

(b) Construct a frequency distribution for the data and find the standard deviation. 

(c) Compare the results of part (b) with that of part (a). Determine whether an application of Sheppard's 
correction produces better results. 



4.67 Work Problem 4.66 for the data of Problem 2.27. 



4.68 (a) Of a total of N numbers, the fraction p are l's, while the fraction q = 1 - p are 0's. Prove that the 
standard deviation of the set of numbers is s /pq. 
(b) Apply the result of part (a) to Problem 4.56(c). 



4.69 (a) Prove that the variance of the set of n numbers a, a + d, a + 2d, . . . , a + (n - \)d (i.e., an arithmetic 

progression with the first term a and common difference d) is given by ^(w 2 — \)d 2 . 
(b) Use part (a) for Problem 4.58. [Hint: Use 1 + 2 + 3 • •• + («- 1) = \n{n - 1), l 2 + 2 2 + 3 2 + • • • + 
(»- l) 2 = !«(«- 1)(2«- 1).] 

4.70 Generalize and prove Property 3 of this chapter. 



EMPIRICAL RELATIONS BETWEEN MEASURES OF DISPERSION 



4.71 By comparing the standard deviations obtained in Problem 4.59 with the corresponding mean deviations 
of Problems 4.41, 4.42, and 4.44, determine whether the following empirical relation holds: Mean 
deviation = \ (standard deviation). Account for any differences that may occur. 
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4.72 By comparing the standard deviations obtained in Problem 4.59 with the corresponding semi-interquartile 
ranges of Problem 4.48, determine whether the following empirical relation holds: Semi-interquartile 
range = | (standard deviation). Account for any differences that may occur. 

4.73 What empirical relation would you expect to exist between the semi-interquariilc range and the mean 
deviation for bell-shaped distributions that are moderately skewed? 

4.74 A frequency distribution that is approximately normal has a semi-interquartile range equal to 10. What 
values would you expect for (a) the standard deviation and (b) the mean deviation? 



ABSOLUTE AND RELATIVE DISPERSION; COEFFICIENT OF VARIATION 

4.75 On a final examination in statistics, the mean grade of a group of 150 students was 78 and the standard 
deviation was 8.0. In algebra, however, the mean final grade of the group was 73 and the standard deviation 
was 7.6. In which subject was there the greater (a) absolute dispersion and (b) relative dispersion? 

4.76 Find the coefficient of variation for the data of (a) Problem 3.59 and (b) Problem 3.107. 

4.77 The distribution of SAT scores for a group of high school students has a first quartile score equal to 825 and 
a third quartile score equal to 1125. Calculate the quartile coefficient of variation for the distribution of 
SAT scores for this group of high school students. 

4.78 For the age group 15-24 years, the first quartile of household incomes is equal to $16,500 and the third 
quartile of household incomes for this same age group is $25,000. Calculate the quartile coefficient of 
variation for the distribution of incomes for this age group. 



STANDARDIZED VARIABLES; STANDARD SCORES 

4.79 On the examinations referred to in Problem 4.75, a student scored 75 in statistics and 71 in algebra. In which 
examination was his relative standing higher? 

4.80 Convert the set 6, 2, 8, 7, 5 into standard scores. 

4.81 Prove that the mean and standard deviation of a set of standard scores are equal to and 1, respectively. 
Use Problem 4.80 to illustrate this. 

4.82 (a) Convert the grades of Problem 3. 107 into standard scores, and (b) construct a graph of relative frequency 
versus standard score. 



SOFTWARE AND MEASURES OF DISPERSION 



4.83 Table 4.1 1 gives the per capita income for the 50 states in 2005. 
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Table 4.11 Per Capita Income for the 50 States 



State 


Per Capita Income 


State 


Per Capita Income 


Wyoming 


36,778 


Pennsylvania 


34,897 


Montana 


29,387 


Wisconsin 


33,565 


North Dakota 


31,395 


Massachusetts 


44,289 


New Mexico 


27,664 


Missouri 


31,899 


West Virginia 


27,215 


Idaho 


28,158 


Rhode Island 


36,153 


Kentucky 


28,513 


Virginia 


38,390 


Minnesota 


37,373 


South Dakota 


31,614 


Florida 


33,219 


Alabama 


29,136 


South Carolina 


28,352 


Arkansas 


26,874 


New York 


40,507 


Maryland 


41,760 


Indiana 


31,276 


Iowa 


32,315 


Connecticut 


47,819 


Nebraska 


33,616 


Ohio 


32,478 


Hawaii 


34,539 


New Hampshire 


38,408 


Mississippi 


25,318 


Texas 


32,462 


Vermont 


33,327 


Oregon 


32,103 


Maine 


31,252 


New Jersey 


43,771 


Oklahoma 


29,330 


California 


37,036 


Delaware 




Colorado 




Alaska 


35,612 


North Carolina 


30,553 


Tennessee 


31,107 


Illinois 


36,120 


Kansas 


32,836 


Michigan 


33,116 


Arizona 


30,267 


Washington 


35,409 


Nevada 


35,883 


Georgia 


31,121 


Utah 


28,061 


Louisiana 


24,820 



The SPSS analysis of the data is as follows: 

Descriptive Statistics 





N 


Range 


Std. Deviation 


Variance 


Income 


50 


22999.00 


4893.54160 


2E + 007 


Valid N (listwise) 


50 









Verify the range, standard deviation, and variance. 



Moments, Skewness, 
and Kurtosis 



MOMENTS 

If Xj , X 2 , ■ ■ ■ , X N are the TV values assumed by the variable X, we define the quantity 



E*7 



N 



called the rth moment. The first moment with r — 1 is the arithmetic mean X. 
The /-ill moment about the mean X is defined as 



If r — 1, then m, = (see Problem 3.16). If r = 2, then m 2 = s , the variance. 
The rth moment about any origin A is defined as 



Y,( x j- a y 



where d = X — A are the deviations of X from A.lf A — 0, equation (5) reduces to equation (7). For this 
reason, equation (7) is often called the rth moment about zero. 



MOMENTS FOR GROUPED DATA 

If X\, X 2 , . . . ,X K occur with frequencies f\,f 2 , ■ ■ ■ ,/k, respectively, the above moments are given by 

K 

E^ 

YT J\X[+f 2 X , 2 + ---+f K X' K ^ j ^ - 



Copyright © 2008, 1999, 1988, 1961 by The McGraw-Hill Companies, Inc. Click here for terms of u: 
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MOMENTS, SKEWNESS, AND KURTOSIS 
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e f( x - xy _ 



U _j:f(x-A) r _ 



(x-xy 



(X - A) r 



where N = J2f=i fj = E/- Tne formulas are suitable for calculating moments from grouped data. 

RELATIONS BETWEEN MOMENTS 

The following relations exist between moments about the mean m r and moments about an arbitrary 
origin m' r : 



m 3 = tn'i — 3m[m 2 + 2m'\ (7) 
m 4 = m\ — 4m[m^ + bm'^m^ — 3m[ A 
etc. (see Problem 5.5). Note that m[ = X - A. 

COMPUTATION OF MOMENTS FOR GROUPED DATA 

The coding method given in previous chapters for computing the mean and standard deviation can 
also be used to provide a short method for computing moments. This method uses the fact that 
Xj = A + cuj (or briefly, X = A + cu), so that from equation (6) we have 

ml = (8) 
which can be used to find m r by applying equations (7). 

CHARLIER'S CHECK AND SHEPPARD'S CORRECTIONS 

Charlier's check in computing moments by the coding method uses the identities: 

Ef(u+l)=Efu + N 

E/("+i) 2 -E./« 2 + 2E/« + ^ 

(9) 

E fiM + 1 ) 3 = E fu + 3 E fi' 2 + 3 E fit + n 

E /(« +i) 4 -E./" 4 + 4 E fi' 3 + 6 E fu 2 + 4 E fu + n 

Sheppard's corrections for moments are as follows: 

Corrected m 2 = m 2 - ^ c 2 Corrected m A = m A - \ c 2 m 2 + ^ c 4 
The moments m l and m 3 need no correction. 

MOMENTS IN DIMENSIONLESS FORM 

To avoid particular units, we can define the dimension/ess moments about the mean as 



where s = ^/m~i is the standard deviation. Since m, = and m 2 = s 2 , we have a { = and a 2 = 1. 



(10) 



MOMENTS, SKEWNESS, AND KURTOSIS 



SKEWNESS 

Skewness is the degree of asymmetry, or departure from symmetry, of a distribution. If the frequency 
curve (smoothed frequency polygon) of a distribution has a longer tail to the right of the central 
maximum than to the left, the distribution is said to be skewed to the right, or to have positive skewness. 
If the reverse is true, it is said to be skewed to the left, or to have negative skewness. 

For skewed distributions, the mean tends to lie on the same side of the mode as the longer tail 
(see Figs. 3-1 and 3-2). Thus a measure of the asymmetry is supplied by the difference: mean-mode. 
This can be made dimensionless if we divide it by a measure of dispersion, such as the standard deviation, 
leading to the definition 

mean — mode X — mode . . 

Skewness = — — — : — : — = (77) 

standard deviation s 

To avoid using the mode, we can employ the empirical formula (70) of Chapter 3 and define 

3 (mean - median) 3(X — median) , , 

Skewness = — — ; — , , . . = (12) 

standard deviation s 

Equations (77) and (72) are called, respectively, Pearson's first and second coefficients of skewness. 
Other measures of skewness, defined in terms of quartiles and percentiles, are as follows: 

Quartile coefficient of skewness = (& ~ 62) - (62 - 60 = 83 ~ 2g 2 + gi {}3) 
63-61 63-61 

10-90 percentile coefficient of skewness = (Ao - - g 50 ~ P ' o) = P9Q " 2P5Q + P '° (14) 

A>o - "10 "90 - "10 

An important measure of skewness uses the third moment about the mean expressed in dimension- 
less form and is given by 

Moment coefficient of skewness = a^ — = — = (15) 

Another measure of skewness is sometimes given by b\ = a\. For perfectly symmetrical curves, 
such as the normal curve, a 3 and b\ are zero. 

KURTOSIS 

Kurtosis is the degree of peakedness of a distribution, usually taken relative to a normal distribution. 
A distribution having a relatively high peak is called leptokurtic, while one which is flat-topped is called 
platykurtic. A normal distribution, which is not very peaked or very flat-topped, is called mesokurtic. 

One measure of kurtosis uses the fourth moment about the mean expressed in dimensionless form 
and is given by 

Moment coefficient of kurtosis = a 4 = ^ = ^ (16) 

which is often denoted by h 2 . For the normal distribution, b 2 — a A — 3. For this reason, the kurtosis is 
sometimes defined by (h 2 — 3), which is positive for a leptokurtic distribution, negative for a platykurtic 
distribution, and zero for the normal distribution. 

Another measure of kurtosis is based on both quartiles and percentiles and is given by 

-A 

where Q = j(Q 3 - 61) is the semi-interquartile range. We refer to k (the lowercase Greek letter kappa) 
as the percentile coefficient of kurtosis; for the normal distribution, n has the value 0.263. 
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POPULATION MOMENTS, SKEWNESS, AND KURTOSIS 

When it is necessary to distinguish a sample's moments, measures of skewness, and measures of 
kurtosis from those corresponding to a population of which the sample is a part, it is often the custom to 
use Latin symbols for the former and Greek symbols for the latter. Thus if the sample's moments are 
denoted by m r and m' r , the corresponding Greek symbols would be /x r and \x r (p, is the Greek letter mu). 
Subscripts are always denoted by Latin symbols. 

Similarly, if the sample's measures of skewness and kurtosis are denoted by a 3 and a 4 , respectively, 
the population's skewness and kurtosis would be a 3 and a 4 (a is the Greek letter alpha). 

We already know from Chapter 4 that the standard deviation of a sample and of a population are 
denoted by s and a, respectively. 



SOFTWARE COMPUTATION OF SKEWNESS AND KURTOSIS 

The software that we have discussed in the text so far may be used to compute skewness and kurtosis 
measures for sample data. The data in Table 5.1 samples of size 50 from a normal distribution, 
a skewed-right distribution, a skewed-left distribution, and a uniform distribution. 

The normal data are female height measurements, the skewed-right data are age at marriage for 
females, the skewed-left data are obituary data that give the age at death for females, and the uniform 
data are the amount of cola put into a 12 ounce container by a soft drinks machine. The distribution of 



Table 5.1 



Normal 


Skewed-right 


Skewed-left 


Uniform 


67 


69 


31 


40 


102 


87 


12.1 


11.6 


70 


62 


43 


24 


55 


104 


12.1 


11.6 


63 


67 


30 


29 


70 


75 


12.4 


12.0 


65 


59 


30 


24 


95 


80 


12.1 


11.6 


68 


66 


38 


27 


73 


66 


12.1 


11.6 


60 


65 


26 


35 


79 


93 


12.2 


11.7 


70 


63 


29 


33 


60 


90 


12.2 


12.3 


64 


65 


55 


75 


73 


84 


12.2 


11.7 


69 


60 


46 


38 


89 


73 


11.9 


11.7 


61 


67 


26 


34 


85 


98 


12.2 


11.7 


66 


64 


29 


85 


72 


79 


12.3 


11.8 


65 


68 


57 


29 


92 


35 


12.3 


12.5 


71 


61 


34 


40 


76 


71 


11.7 


11.8 


62 


69 


34 


41 


93 


90 


12.3 


11.8 


66 


65 


36 


35 


76 


71 


12.3 


11.8 


68 


62 


40 


26 


97 


63 


12.4 


11.9 


64 


67 


28 


34 


10 


58 


12.4 


11.9 


67 


70 


26 


19 


70 


82 


12.1 


11.9 


62 


64 


66 


23 


85 


72 


12.4 


12.2 


66 


63 


63 


28 


25 


93 


12.4 


11.9 


65 


68 


30 


26 


83 


44 


12.5 


12.0 


63 


64 


33 


31 


58 


65 


11.8 


11.9 


66 


65 


24 


25 


10 


77 


12.5 


12.0 


65 


61 


35 


22 


92 




12.5 


12.0 


63 


66 


34 


28 


82 


77 


12.5 


12.0 
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the four sets of sample data is shown in Fig. 5-1. The distributions of the four samples are illustrated by 
dotplots. 



Dotplot of Height, Wedage, Obitage, Cola-fill 



Height 


■ ■ ■ 1 1 


1 1 1 


1 1 ■ ■ ■ 


60 62 


64 66 


68 70 


Wedage 


. ..i.i.iiii tii. 1 1. . . .... 


18 27 36 


45 54 


63 72 81 


Obitage 


» , • , • « 




.tH.H«».l»t.« .. 


14 28 42 


56 


70 84 98 


Cola-fill 


■ III 


— 1 1 


till 



11.6 11.8 12.0 12.2 12.4 

Fig. 5-1 MINITAB plot of four distributions: normal, right-skewed, left-skewed, and uniform. 



The variable Height is the height of 50 adult females, the variable Wedage is the age at the wedding 
of 50 females, the variable Obitage is the age at death of 50 females, and the variable Cola-fill is the 
amount of cola put into 12 ounce containers. Each sample is of size 50. Using the terminology that we 
have learned in this chapter: the height distribution is mesokurtic, the cola-fill distribution is platykurtic, 
the wedage distribution is skewed to the right, and the obitage distribution is skewed to the left. 

EXAMPLE 1 . If MINITAB is used to find skewness and kurtosis values for the four variables, the following results 
are obtained by using the pull-down menu "Stat =>• Basic statistics => Display descriptive statistics." 

Descriptive Statistics: Height, Wedage, Obitage, Cola-fill 

Variable N N* Mean StDev Skewness Kurtosis 



Height 50 65.120 2.911 -0.02 -0.61 

Wedage 50 35.48 13.51 1.98 4.10 

Obitage 50 74.20 20.70 -1.50 2.64 

Cola-fill 50 12.056 0.284 0.02 -1.19 



The skewness values for the uniform and normal distributions are seen to be near 0. The skewness measure is 
positive for a distribution that is skewed to the right and negative for a distribution that is skewed to the left. 

EXAMPLE 2. Use EXCEL to find the skewness and kurtosis values for the data in Fig. 5-1. If the variable 
names are entered into A1:D1 and the sample data are entered into A2:D51 and any open cell is used to 
enter =SKEW(A2:A51) the value -0.0203 is returned. The function =SKEW(B2:B51) returns 1.9774, 
the function =SKEW(C2:C51) returns -1.4986, and =SKEW(D2:D51) returns 0.0156. The kurtosis values are 
obtained by =KURT(A2:A51) which gives -0.6083, =KURT(B2:B51) which gives 4.0985, =KURT(C2:C51) 
which gives 2.6368, and =KURT(D2:D51) which gives -1.1889. It can be seen that MINITAB and EXCEL give 
the same values for kurtosis and skewness. 



EXAMPLE 3. When STATISTIX is used to analyze the data shown in Fig. 5-1, the pull-down "Statistics => 
Summary Statistics => Descriptive Statistics" gives the dialog box in Fig. 5-2. 
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P N 


r c.v. 


r Missing 


r Median 


r Sun 


r Miti/man 


|7 Mean 


T Quarliles 


17 ED 


r MAD 


T Variance 




T SE Mean 


17 Skew 


T Corf. Int. 





Fig. 5-2 Dialogue 

Note that N, Mean, SD, Skew, and Kurtosis a 
output is as follows. 

Descriptive Statistics 

Variable 

Cola 
Height 
Obitage 
Wedage 



30x for STATISTIX. 

; checked as the statistics to report. The STATISTIX 



Mean 

12.056 
65.120 
74.200 
35.480 



SD 

0.2837 
2.9112 
20.696 
13.511 



Skew 

0.0151 
-0.0197 
-1.4533 
1.9176 



Kurtosis 

-1.1910 
-0.6668 
2.2628 
3 . 5823 



Since the numerical values differ slightly from M1N1TAB and EXCEL, it is obvious that slightly 
different measures of skewness and kurtosis are used by the software. 



EXAMPLE 4. The SPSS pull-down menu "Analyze => Descriptive Statistics => Descriptives" gives the dialog box in 
Fig. 5-3 and the routines Mean, Std. deviation, Kurtosis, and Skewness are chosen. SPSS gives the same measures of 
skewness and kurtosis as EXCEL and MINITAB. 



0Mean DSum \ Continue J 

Dispersi0n I Cance , ] 

Std. deviation □Minimum 



□ Variance 

□ Range 

Distribution 
Kurtosis 0! 



□ Maximum 

□ S.E. mean 



Display Order 

©Variable list 

O Alphabetic 

Ascending means 
Descending mean:; 



Fig. 5-3 Dialogue box for SPSS. 



MOMENTS, SKEWNESS, AND KURTOSIS 



The following SPSS output is given. 



Descriptive Statistics 





N 


Mean 


Std. 


Skewness 


Kurtosis 


Statistic 


Statistic 


Statistic 


Statistic 


Std. Error 


Statistic 


Std. Error 


Height 


50 


65.1200 


2.91120 


-.020 


.337 


-.608 


.662 


Wedage 


50 


35.4800 


13.51075 


1.977 


.337 


4.098 


.662 


Obitage 


50 


74.2000 


20.69605 


-1.499 


.337 


2.637 


.662 


Colafill 


50 


12.0560 


.28368 


.016 


.337 


-1.189 


.662 


Valid N 


50 














(listwise) 

















EXAMPLE 5. When SAS is used to compute the skewness and kurtosis values, the following output results. 
It is basically the same as that given by EXCEL, MINITAB, and SPSS. 



The MEANS Procedure 
Variable Mean Std Dev N Skewness Kurtosis 

ff ff f ff ff ff ff f fff f ff ff f ff ff ff ff f ff ff ff ff f ff ff ff ff f ff ff ff ff f ff ff ff ff f fff 
Height 65.1200000 2.9112029 50 -0.0203232 -0.6083437 

Wedage 35.4800000 13.5107516 50 1.9774237 4.0984607 

Obitage 74.2000000 20.6960511 50 -1.4986145 2.6368045 

Cola_fill 12.0560000 0.2836785 50 0.0156088 -1.1889600 

ff ff f fff f fff f ff ff ff ff f ff f ff ff ff f ff ff ff ff f ff ff ff ff f ff ff ff ff f ff ff ff ff f ff f 



Solved Problems 

MOMENTS 

5.1 Find the (a) first, (b) second, (c) third, and (d) fourth moments of the set 2, 3, 7, 8, 10. 
SOLUTION 

(a) The first moment, or arithmetic mean, is 

- £ x 2 + 3 + 7 + 8+ 10 30 , 

X = H^ = 5 = T = 6 



(b) The second moment is 



_ J2 X 2 _ 2 2 + 3 2 + 7 2 + 8 2 + 10 2 _ 226 _ 



(c) The third moment is 

— _ J2 ^ _ 2 3 + 3 3 + 7 3 + 8 3 + 10 3 _ 1890 _ 

(d) The fourth moment is 
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_E(^- 


xf _ 


(2- 


6) 2 +(3 


-6) 2 + (7 


-6) 2 + 


3 - 6) 2 + (10 


-6) 2 


46 


N 










6 






~ 5 ~ 


the variai 


ice .s 2 . 


















X? 


(2- 


6) 3 + (3 


-6) 3 + (7- 


-6) 3 + 


i - 6) 3 + (10 


- 6) 3 


-18 


N 

_E(^- 


xf _ 


(2- 


6) 4 +(3 


-6) 4 + (7 


5 

-6) 4 + (: 


S-6) 4 + (10 


-6) 4 


5 

610 



5.2 Find the (a) first, (ft) second, (c) third, and (J) fourth moments about the mean for the set of 
numbers in Problem 5.1. 

SOLUTION 

M » - T^^S _ E ~ _ (2 - 6) + (3 - 6) + (7 - 6) + (8 - 6) + (10 - 6) _ _ 

m x is always equal to zero since X-X = X- X = (see Problem 3.16). 
(ft) m 2 = (X-X) 2 : 

Note that m 2 is 

(c) ^ 3 = (X-Z) 3 : 

(d) ™ 4 = (X-Z) 4 : 

5.3 Find the (a) first, (ft) second, (c) third, and (d) fourth moments about the origin 4 for the set of 
numbers in Problem 5.1. 

SOLUTION 

(a) m[=JX^A) 

(ft) B j = (^= ^p = ^~^ ^A±^ ZAi^ -rv^-t; = ^ = i 3 .2 

(c) m ;-(^7 = ^^~ 4 ) 3 ^ ( 2 ~ 4 ) 3 + ( 3 ~ 4 ) 3 + ( 7 ~ 4 ) 3 + ( 8 ~ 4 ) 3 + ( 10 ~ 4 ) 3 = 298 = 59.6 

^ = 330 

5.4 Using the results of Problems 5.2 and 5.3, verify the relations between the moments 
(a) m 2 = m 2 — m' 2 , (ft) m 3 = — 3m' 1 m 2 + 2m' 3 , and (c) m 4 = m\ — 4m[m^ + 6m' 2 m 2 — 3m{ 4 . 

SOLUTION 

From Problem 5.3 we have m[ = 2, m 2 = 13.2, m'^ = 59.6, and m\ = 330. Thus: 
(a) m 2 =m 2 - m[ 2 = 13.2 - (2) 2 = 13.2 - 4 = 9.2 

(ft) m 3 = rn'i - lm[m 2 + 2m[ 3 = 59.6 - (3)(2)(13.2) + 2(2) 3 = 59.6 - 79.2 + 16 = -3.6 
(c) m 4 = mi - 4m[m^ + 6m[ 2 m 2 - 3mJ 4 = 330 - 4(2)(59.6) + 6(2) 2 (13.2) - 3(2) 4 = 122 
it with Problem 5.2. 





4) 


(2- 


4) + (3 - 4) + (7 - 4) + (8 - 


4) + (10 -4) 


2 


N 






5 








-4) 2 


(2 


- 4) 2 + (3 - 4) 2 + (7 - 4) 2 + 


(8-4) 2 + (10- 


-4) 2 


N 






5 






_E(^- 


-4) 3 


_(2 


-4) 3 + (3-4) 3 + (7-4) 3 + 


(8-4) 3 + (10- 


4) 3 


N 






5 






■ 


-4) 4 


(2 


- 4) 4 + (3 - 4) 4 + (7 - 4) 4 + 


(8-4) 4 + (10- 


■4) 4 


N 






5 







5.5 Prove that (a) m 2 = m' 2 — m' 2 , (ft) m 3 = — 3m[m 2 + 2m' 3 , and (c) m A = m\ — 4m[m 
6m' 2 m 2 - 3m[ 4 . 



SOLUTION 

If d = X - A, then X = A + d, X = A + d, and X - X = 



m 2 = (X - X) 1 = {d- df = d 2 - 2dd + d 2 
= d 2 - 2d 2 + d 2 = d 2 - d 2 = m 2 - m{ 2 
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m 3 = (X- Xf = {d- df = (d 3 - 3d 2 d + 3dd 2 - d 3 ) 

= cP- 3d! 2 + 3d 3 - d 3 = d 3 - 3M + 2d 3 = m 3 - 3m[m 2 + 2 

m 4 = {X - X) A = {d - df = {d 4 - 4d 3 d + 6d 2 d 2 - Add 3 + d 4 ) 
= d 4 - Add 3 + bd 2 ! 2 _ 4 d 4 + d 4 - Ad! 3 + dd 2 ! 2 - 3d 4 
= m\ — 4m[m 3 + 6m' 2 m 2 — 3m' 4 
By extension of this method, we can derive similar results for m 5 , m h . etc. 



COMPUTATION OF MOMENTS FROM GROUPED DATA 

5.6 Find the first four moments about the mean for the height distribution of Problem 3.22. 
SOLUTION 

The work can be arranged as in Table 5.2, from which we have 



m 2 = m' 2 - m' 2 = 8.73 - (0.45) 2 = 8.5275 

m 3 = m[ - 3m[m' 2 + m[ 3 = 8.91 - 3(0.45)(8.73) + 2(0.45) 3 = -2.6932 
m 4 = m[ — 4m[m 3 + 6m' 2 m 2 — 3m' 4 

= 204.93 - 4(0.45)(8.91) + 6(0.45) 2 (8.73) - 3(0.45) 4 = 199.3759 



Table 5.2 



X 




/ 


fit 


fu 2 




/" 4 


61 


-2 


5 


-10 


20 


-40 


80 


64 


-1 


18 


-18 


18 


-18 


18 


67 





42 














70 


1 


27 


27 


27 


27 


27 


73 


2 


8 


16 


32 


64 


128 




AT = £/=10 


E/«=i5 


E/» 2 = 97 


E/" 3 = 33 


E/" 4 = 253 



5.7 Find (a) m[, (b) m' 2 , (c) m^, (d) m' A , (e) m u (f) m 2 , (g) m 3 , (h) m 4 , (i) X, (./) s, (k) X 2 , and (/) X 3 
for the distribution in Table 4.7 of Problem 4.19. 



SOLUTION 



The work can be arranged as in Table 5.3. 
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Table 5.3 



X 


u 


/ 


/" 


fu 2 


fa 3 


fa 4 


70 


~ 6 


4 


-24 


144 


-864 


5184 


74 


~ 5 


9 


-45 


225 


-1125 


5625 


78 


~ 4 


16 


-64 


256 


-1024 


4096 


82 


~ 3 


28 


-84 


252 


-756 


2268 


86 


~ 2 


45 


-90 


180 


-360 


720 




~* 












-> 94 




85 


6 Q 


6 () 


~ 6< o 


6 () 


98 


1 


72 


72 


72 


72 


72 


102 




54 


108 


216 


432 


864 


106 


3 


38 


114 


342 


1026 


3078 


110 


4 


27 


108 


432 


1728 


6912 


114 


5 


18 


90 


450 


2250 


11250 


118 


6 


11 


66 


396 


2376 


14256 


122 


7 


5 


34 


245 


1715 


12005 


126 


8 




16 


128 


1024 


8192 




N = E / = 480 


E/" = 236 


E/" 2 = 3404 


E fu 3 = 6428 


E fu 4 = 74,588 



(a) 


m[ 


_ c E/«_ 

N 




1.9667 


(b) 


m'l 


_ C 2E/" 2 

N 


= (4) 2 


/3404 

Uso 


) = 113.4667 


(c) 


m'-i 


N 


= (4) 3 


/6428 N 
1,480, 


= 857.0667 


id) 
(e) 




_aT.Ju 4 


= (4) 4 


/74,58 


= 39,780.2667 




N 

= 


V 480 





(f) m 1 =m 1 - m[ 2 = 1 13.4667 - (1.9667) 2 = 109.5988 

(g) m 3 =m- i - 3m[m' 2 + 2m[ 3 = 857.0667 - 3(1.9667)(1 13.4667) + 2(1.9667) 3 = 202.8158 

(h) m 4 =ni4- 4m[mj + (sm^m^ - 3w[ 4 = 35,627.2853 

(0 X = (A + d) =A + m[=A + = 94 + 1.9667 = 95.97 

(J) s = jmi = V109.5988 = 10.47 

(k) X 2 = {A + d) 2 = (A 2 + 2Ad + d 1 ) = A 2 + 2Ad + d 2 = A 2 + 2Am[ + m' 2 

= (94) 2 + 2(94)(1.9667) + 113.4667 = 9319.2063, or 9319 to four significant figures 

(/) X 3 = {A + df = (A 3 + 3A 2 d + 3Ad 2 + d y ) = A 3 + 3A 2 d + 3Ad 2 + d 3 

= A 3 + 3A 2 m[ + 3Am' 2 + = 915,571.9597, or 915,600 to four significant figures 
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CHARLIER'S CHECK 

5.8 Illustrate the use of Owner's check for the computations in Problem 5.7. 
SOLUTION 

To supply the required check, we add to Table 5.3 the columns shown in Table 5.4 (with the exception 
of column 2, which is repeated in Table 5.3 for convenience). 

In each of the following groupings, the first is taken from Table 5.4 and the second is taken from 
Table 5.2. Equality of results in each grouping provides the required check. 



Table 5.4 



u+\ 


/ 


/("+!) 


7>+l) 2 


f(u+\f 


7>+l) 4 


-5 


4 


-20 


100 


-500 


2500 


-4 


9 


-36 


144 


-576 


2304 


-3 


16 


-48 


144 


-432 


1296 


-2 


28 


-56 


112 


-224 


448 


-1 


45 


-45 


45 


-45 


45 





66 














1 


85 


85 


85 


85 


85 


2 


72 


144 


288 


576 


1152 


3 


54 


162 


486 


1458 


4374 


4 


38 


152 


608 


2432 


9728 


5 


27 


135 


675 


3375 


16875 


6 


18 


108 


648 


3888 


23328 


7 


11 


77 


539 


3773 


26411 


8 


5 


40 


320 


2560 


20480 


9 


2 


18 


162 


1458 


13122 




N = Ef 


£/(«+!) 


£./>+i) 2 


£/>+i) 3 


E/("+i) 4 




= 480 


= 716 


= 4356 


= 17,828 


= 122,148 



£/(«+!) = 716 

Y, fu + N = 236 + 480 = 716 

E7>+ 1) 2 = 4356 

J2fu 2 + 2J2fi' + N= 3404 + 2(236) + 480 = 4356 
£/(«+ l) 3 = 17,828 

E/« 3 + 3E/" 2 + 3E/" + W= 6428 + 3(3404) + 3(236) + 480 = 17,828 
E/(«+l) 4 = 122,148 

Y, fu 4 + 4Y,fi' 3 + 6Y,fu 2 + 4Y,fi' + N= 74,588 + 4(6428) + 6(3404) + 4(236) + 480 = 122,148 
SHEPPARD'S CORRECTIONS FOR MOMENTS 

5.9 Apply Sheppard's corrections to determine the moments about the mean for the data in 
(a) Problem 5.6 and (b) Problem 5.7. 
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SOLUTION 

(a) Corrected m 2 = m 2 - c 2 /l2 = 8.5275 - 3 2 /12 = 7.7775 
Corrected m 4 = m 4 - \ c 2 m 2 + ^ c 4 

= 199.3759 - \ (3) 2 (8.5275) + ^ (3) 4 
= 163.3646 
m x and m 2 need no correction. 

(b) Corrected m 2 = m 2 - c 2 /l2 = 109.5988 - 4 2 /12 = 108.2655 
Corrected m 4 = m A - \ c 2 m 2 + ^ c 4 

= 35,627.2853 - \ (4) 2 ( 109.5988) + ^ (4) 4 
= 34,757.9616 



5.10 Find Pearson's (a) first and (b) second coefficients of skewness for the wage distribution of the 
65 employees at the P&R Company (see Problems 3.44 and 4.18). 

SOLUTION 

Mean = $279.76, median = $279.06, mode = $277.50, and standard deviation s = $15.60. Thus: 

(a) First coefficient of skewness =■ ^ — $15 60 

... _ , _. t ,. 3 (mean - median) 3($279.76 - $279.06) 

(b) Second coefficient of skewness = — = — — — = 0.1346, or 0.13 

If the corrected standard deviation is used [see Problem 4.21(6)], these coefficients become, respectively: 

, , Mean - mode $279.76 - $277.50 

(a) Corrected. = $T^3l = 0-1474, or 0.15 

3(mean - median) = 3($279.76 - $279.06) = 13?Q 
Corrected s $15.33 

Since the coefficients are positive, the distribution is skewed positively (i.e., to the right). 

5.11 Find the (a) quartile and (b) percentile coefficients of skewness for the distribution of Problem 
5.10 (see Problem 3.44). 

SOLUTION 

g, = $268.25, Q 2 = P 50 = $279.06, g 3 = $290.75, P i0 = D l = $258.12, andP 90 = D 9 = $301.00. Thus: 



_ Qi ' 2Qi + Qi _ $290.75 - 2($279.06) + $268.25 
$290.75 - $268.25 



(b) Percentile coefficient of skewness = 



0.0391 

P 90 - IP 50 + P, _ $301.00 - 2($279.06) + $258.12 _ 



$301.00 -$258.12 



5.12 Find the moment coefficient of skewness, a 3 , for (a) the height distribution of students at XYZ 
University (see Problem 5.6) and (b) the IQ's of elementary school children (see Problem 5.7). 
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SOLUTION 

(a) m 2 = s 2 = 8.5275, and m 3 = -2.6932. Thus: 

m = m^ = -2£32 = _ Q1081 ^ _ Q 

s 3 (V^) 3 (x/^5275) 3 

If Sheppard's corrections for grouping are used [see Problem 5.9(a)], then 

™3 -2.6932 _„„ 

Corrected a, = J = , = -0. 1242 or 

(^/corrected m 2 ) 3 (V7.7775) 3 

m 3 ot, 202.8158 



s 3 (y^) 3 (a/109.5988) 3 

If Sheppard's corrections for grouping are used [see Problem 5.9(6)], then 

Corrected a 3 = , w 3 ^ ^^ = 202 .8158 = 18Q0 Qr o. 18 
(^/corrected m 2 ) 3 (V108.2655) 3 

Note that both distributions are moderately skewed, distribution (a) to the left (negatively) and 
distribution (b) to the right (positively). Distribution (b) is more skewed than (a); that is, (a) is more 
symmetrical than (b), as is evidenced by the fact that the numerical value (or absolute value) of the 
skewness coefficient for (b) is greater than that for (a). 



5.13 Find the moment coefficient of kurtosis, a 4 , for the data of (a) Problem 5.6 and (b) Problem 5.7. 
SOLUTION 

m A m 4 199.3759 

(a) a 4 = ^ = ^ = ^ = 2.7418 or 2.74 

/ m\ (8.5275) 2 

If Sheppard's corrections are used [see Problem 5.9(a)], then 

corrected nn 163.36346 

Corrected a 4 = 4 — = r = 2.7007 or 2.70 

(corrected m 2 f (7.7775) 3 

I = 2.9660 or 2.97 



" s 4 m 2 2 (109.5988) 2 
If Sheppard's corrections are used [see Problem 5.9(6)], then 

corrected m 4 34,757.9616 

Corrected a A = ~ = T = 2.9653 or 2.97 

(corrected m 2 f (108.2655) 2 

Since for a normal distribution a 4 = 3, it follows that both distributions (a) and (b) are platykurlic with 
respect to the normal distribution (i.e., less peaked than the normal distribution). 

Insofar as peakedness is concerned, distribution (/>) approximates the normal distribution much better 
than does distribution (a). However, from Problem 5.12 distribution (a) is more symmetrical than (b), so that 
as far as symmetry is concerned, (a) approximates the normal distribution better than (b) does. 

SOFTWARE COMPUTATION OF SKEWNESS AND KURTOSIS 

5.14 Sometimes the scores on a test do not follow the normal distribution, although they usually do. 
We sometimes find that students score low or high with very few in the middle. The distribution 
shown in Fig. 5-4 is such a distribution. This distribution is referred to as the U distribution. 
Find the mean, standard deviation, skewness, and kurtosis for the data using EXCEL. 
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Dotplot of scores 



Fig. 5-4 .A MINITAB plot of data that follows a U distribution. 



The data is entered into A1:A30 of an EXCEL worksheet. The command "=AVERAGE(A1:A30)" gives 50. 
The command "=STDEV(A1:A30)" gives 29.94. The command "=SKEW(A1:A30)" gives 0. The command 
"=KURT(A1:A30)" gives -1.59. 



Supplementary Problems 



MOMENTS 

5.15 Find the (a) first, (b) second, (c) third, and (d) fourth moments for the set 4, 7, 5, 9, 8, 3, 6. 

5.16 Find the (a) first, (b) second, (c) third, and (d) fourth moments about the mean for the set of numbers 
in Problem 5.15. 



5.17 Find the (a) first, (b) second, (c) third, and (d) fourth moments about the number 7 for the set of numbers 
in Problem 5.15. 

5.18 Using the results of Problems 5.16 and 5.17, verify the relations between the moments (a) m 2 = m' 2 - m' 2 , 
(b) «? 3 = »13 — 3m\m 2 + 2m[\ and (c) m 4 = m'^ — 4m[m'y + 6m' 2 m' 2 — 3m' 4 . 

5.19 Find the first four moments about the mean of the set of numbers in the arithmetic progression 2, 5, 8, 1 1, 
14, 17. 

5.20 Prove that (a) m 2 = m 2 + h 2 , (b) = m i + 3hm 2 + h 1 , and (c) m\ = m A + 4/™ 3 + 6h 2 m 2 + h 4 , where 



5.21 If the first moment about the number 2 is equal to 5, what is the mean? 

5.22 If the first four moments of a set of numbers about the number 3 are equal to —2, 10, —25, and 50, determine 
the corresponding moments (a) about the mean, (b) about the number 5, and (c) about zero. 

5.23 Find the first four moments about the mean of the numbers 0, 0, 0, 1, 1, 1, 1, and 1. 

5.24 (a) Prove that m 5 = m' 5 - 5m[m' A + \()m' 2 m' 3 - I0m[ 3 m 2 + 4m[ 5 . 

(b) Derive a similar formula for m b . 

5.25 Of a total of N numbers, the fraction p are l's, while the fraction q = 1 -p are 0's. Find (a) m u (b) m 2 , 

(c) m 3 , and (d) m 4 for the set of numbers. Compare with Problem 5.23. 

5.26 Prove that the first four moments about the mean of the arithmetic progression a, a + d, a + 2d,...,a + 
(n - \)d are m x = 0, m 2 = ±(n 2 - \)d 2 , w 3 = 0, and m A = ^ (« 2 - 1)(3« 2 - l)d A . Compare with Problem 
5.19 (see also Problem 4.69). [Hint: l 4 + 2 4 + 3 4 + ■ • • + (n - l) 4 = ±n(n - l)(2n - 1)(3» 2 - 3n - 1).] 
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MOMENTS FOR GROUPED DATA 

5.27 Calculate the first four moments about the mean for the distribution of Table 5.5. 

Table 5.5 



/ 



5.28 Illustrate the use of Charlier's check for the computations in Problem 5.27. 

5.29 Apply Sheppard's corrections to the moments obtained in Problem 5.27. 

5.30 Calculate the first four moments about the mean for the distribution of Problem 3.59(a) without Sheppard's 
corrections and (b) with Sheppard's corrections. 

5.31 Find (a) m u (b) m 2 , (c) m 3 , (d) m 4 , (e) X, (J) s, (g) X 1 , (h) X 1 , (/) X 1 , and (/) {X + if for the distribution of 
Problem 3.62. 



5.32 Find the moment coefficient of skewness, a 3 , for the distribution of Problem 5.27 (a) without and (b) with 
Sheppard's corrections. 

5.33 Find the moment coefficient of skewness, a 3 , for the distribution of Problem 3.59 (see Problem 5.30). 

5.34 The second moments about the mean of two distributions are 9 and 16, while the third moments about the 
mean are -8.1 and -12.8, respectively. Which distribution is more skewed to the left? 

5.35 Find Pearson's (a) first and (b) second coefficients of skewness for the distribution of Problem 3.59, and 
account for the difference. 

5.36 Find the (a) quartile and (b) percentile coefficients of skewness for the distribution of Problem 3.59. Compare 
your results with those of Problem 5.35 and explain. 

5.37 Table 5.6 gives three different distributions for the variable X. The frequencies for the three distributions are 
given by f\,f 2 , and/ 3 . Find Pearson's first and second coefficients of skewness for the three distributions. 
Use the corrected standard deviation when computing the coefficients. 



X 


/i 


h 


h 




10 


1 


1 


1 


5 




2 


2 


2 


14 


2 


3 


2 




5 


4 


1 




10 



138 



MOMENTS, SKEWNESS, AND KURTOSIS 



[CHAP. 5 



KURTOSIS 



5.38 Find the moment coefficient of kurtosis, a 4 , for the distribution of Problem 5.27 (a) wilhoul and (b) with 
Sheppard's corrections. 

5.39 Find the moment coefficient of kurtosis for the distribution of Problem 3.59 (a) without and (b) with 
Sheppard's corrections (see Problem 5.30). 

5.40 The fourth moments about the mean of the two distributions of Problem 5.34 are 230 and 780, respectively. 
Which distribution more nearly approximates the normal distribution from the viewpoint of (a) peakedness 
and (b) skewness? 

5.41 Which of the distributions in Problem 5.40 is (a) leptokurtic, (b) mesokurtic, and (c) platykurtic? 

5.42 The standard deviation of a symmetrical distribution is 5. What must be the value of the fourth moment 
about the mean in order that the distribution be (a) leptokurtic, (b) mesokurtic, and (c) platykurtic? 

5.43 (a) Calculate the percentile coefficient of kurtosis, k, for the distribution of Problem 3.59. 

(b) Compare your result with the theoretical value 0.263 for the normal distribution, and interpret. 

(c) How do you reconcile this result with that of Problem 5.39? 



SOFTWARE COMPUTATION OF SKEWNESS AND KURTOSIS 

5.44 The data in Fig. 5-5 show a sharp peak at 50. This should show up in the kurt 

EXCEL, show that the skewness is basically zero and that the kurtosis is 2.0134. 




Elementary 
Probability Theory 



DEFINITIONS OF PROBABILITY 
Classic Definition 

Suppose that an event E can happen in h ways out of a total of n possible equally likely ways. Then 
the probability of occurrence of the event (called its success) is denoted by 

p = Pr{E} = - n 

The probability of nonoccurrence of the event (called its failure) is denoted by 



Thus p + q = 



q = Pr{not E} = — ^— = 1 - - = 1 -p = 1 - Pr{E} 
, or Pr{i?} + Prjnot E} = 1. The event "not is sometimes denoted by E, E, or ~ 



EXAMPLE 1. When a die is tossed, there are 6 equally possible ways in which the die c 



and the probability of £ is Pr{ii} = 2/6 or 1/3. The probability of not getting a 3 or 4 (i.e., getting a 1, 
2, 5, or 6) is Pr{E} = 1 - Py{E} = 2/3. 

Note that the probability of an event is a number between and 1. If the event cannot occur, 
its probability is 0. If it must occur (i.e., its occurrence is certain), its probability is 1. 

If p is the probability that an event will occur, the odds in favor of its happening are p : q (read "p to 
</"); the odds against its happening are q : p. Thus the odds against a 3 or 4 in a single toss of a fair die 
are q : p = \ : \ = 2 : 1 (i.e., 2 to 1). 
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Relative-Frequency Definition 

The classic definition of probability has a disadvantage in that the words "equally likely" are vague. 
In fact, since these words seem to be synonymous with "equally probable," the definition is circular 
because we are essentially defining probability in terms of itself. For this reason, a statistical definition of 
probability has been advocated by some people. According to this the estimated probability, or empirical 
probability, of an event is taken to be the relative frequency of occurrence of the event when the number 
of observations is very large. The probability itself is the limit of the relative frequency as the number of 
observations increases indefinitely. 

EXAMPLE 2. If 1000 tosses of a coin result in 529 heads, the relative frequency of heads is 529/1000 = 0.529. 
If another 1000 tosses results in 493 heads, the relative frequency in the total of 2000 tosses is 
(529 + 493)/2000 = 0.51 1. According to the statistical definition, by continuing in this manner we should ultimately 
get closer and closer to a number that represents the probability of a head in a single toss of the coin. From the 
results so far presented, this should be 0.5 to one significant figure. To obtain more significant figures, further 
observations must be made. 

The statistical definition, although useful in practice, has difficulties from a mathematical point of 
view, since an actual limiting number may not really exist. For this reason, modern probability theory 
has been developed axioniatically- that is, the theory leaves the concept of probability undefined, much 
the same as point and line are undefined in geometry. 



CONDITIONAL PROBABILITY; INDEPENDENT AND DEPENDENT EVENTS 

If Ei and E 2 are two events, the probability that E 2 occurs given that E x has occurred is denoted by 
Pi{E 2 \E { }, or Pr{i?2 given E { }, and is called the conditional probability of E 2 given that E x has occurred. 

If the occurrence or nonoccurrence of E x does not affect the probability of occurrence of E 2 , then 
Pr{E 2 \Ei} = Pr{£2j an d we say that E x and E 2 are independent events; otherwise, they are dependent 
events. 

If we denote by E l E 2 the event that "both E y and E 2 occur," sometimes called a compound event, 

then 

Pr{E x E 2 } =Pr{E x }Px{E 2 \E{] (7) 

In particular, 

¥r{E x E 2 } = Pr{E x ] Pr{E 2 ] for independent events (2) 

For three events E x , E 2 , and E 3 , we have 

PviE^E,} = Pr^} PriE^} Pv{E 3 \E { E 2 } (J) 

That is, the probability of occurrence of E u E 2 , and E 3 is equal to (the probability of E{) x (the 
probability of E 2 given that E l has occurred) x (the probability of E 3 given that both E l and E 2 have 
occurred). In particular, 

PriE^Ei} = Pr{Ei} Pr{E 2 } Pr{E 3 } for independent events (4) 

In general, if E\, E 2 , E 3 ,...,E„ are n independent events having respective probabilities p\, p 2 , 
p 3 ,... ,p„, then the probability of occurrence of E l and E 2 and E 2 and • • • E n is Pip 2 p 3 ■ ■ -p„. 

EXAMPLE 3. Let E { and E 2 be the events "heads on fifth toss" and "heads on sixth toss" of a coin, respectively. 
Then £] and E 2 are independent events, and thus the probability of heads on both the fifth and sixth tosses is 
(assuming the coin to be fair) 



Pr{£,£ 2 } = Pr{£,} Pr{£ 2 } = Q) = \ 
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EXAMPLE 4. If the probability that A will be alive in 20 years is 0.7 and the probability that B will be alive in 
20 years is 0.5, then the probability that they will both be alive in 20 years is (0.7)(0.5) = 0.35. 

EXAMPLE 5. Suppose that a box contains 3 white balls and 2 black balls. Let E x be the event "first ball drawn is 
black" and E 2 the event "second ball drawn is black," where the balls are not replaced after being drawn. Here E x 
and E 2 are dependent events. 

The probability that the first ball drawn is black is Pr{£, } = 2/(3 + 2) = |. The probability that the second ball 
drawn is black, given that the first ball drawn was black, is Pr{£ 2 |£i} = 1/(3 + 1) = \. Thus the probability that 
both balls drawn are black is 

Pr{£,£ 2 } = Pr{£,} Pr{£ 2 |£, } = | ■ I = 1 



MUTUALLY EXCLUSIVE EVENTS 

Two or more events are called mutually exclusive if the occurrence of any one of them excludes the 
occurrence of the others. Thus if E x and E 2 are mutually exclusive events, then Prj^i^} = 0. 
If E 1 + E 2 denotes the event that "either E x or E 2 or both occur," then 



In particular, 



Pr{£, + E 2 } = PT{Ei} + Pr{£ 2 } - ?v{E { E 2 ) 



Pr{£' 1 + E 2 } = Pr-t^} + Pr{E 2 ] for mutually exclusive events 



(5) 
(6) 

As an extension of this, if E x , E 2 ,...,E n are n mutually exclusive events having respective prob- 
abilities of occurrence pi, p 2 , . . . ,p n , then the probability of occurrence of either E x or E 2 or •••£'„ is 



Result (5) can also be generalized to three or more mutually exclusive events. 



EXAMPLE 6. If E x is the event "drawing an ace from a deck of cards" and E 2 is t 
Pr{£]} = ^ = L am i Pr{£ 2 } = ^ = -L. The probability of drawing either an ace o 



: event "drawing a king," 
i king in a single draw is 







Pr{£i + E 2 


} = Pr{£ 
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EXAMPLE 7. If E x is the event ' 
E x and E 2 are not mutually exclus 
either an ace or a spade or both 



'drawing an ace" from a deck of cards and E 2 is the event "drawing a spade," then 
ive since the ace of spades can be drawn (Fig. 6-2). Thus the probability of drawing 
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Fig. 6-2 Ei is the event "drawing an ace" and E 2 is the event "drawing a spade." 
Note that the event "E x and E 2 " consisting of those outcomes in both events is the ace of spades. 
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PROBABILITY DISTRIBUTIONS 
Discrete 

If a variable X can assume a discrete set of values X x , X 2 ,...,X K with respective probabilities p u 
p 2 ,..., Pk, where p x +p 2 + ■ ■ ■ + Pk — 1> we sa Y that a discrete probability distribution for X has been 
defined. The function p{X), which has the respective values p x , p 2 ,. . . ,p K for X = X u X 2 ,...,X K , is 
called the probability function, or frequency function, of X. Because X can assume certain values with 
given probabilities, it is often called a discrete random variable. A random variable is also known as a 
chance variable or stochastic variable. 

EXAMPLE 8. Let a pair of fair dice be tossed and let X denote the sum of the points obtained. Then the probability 
distribution is as shown in Table 6.1. For example, the probability of getting sum 5 is ^ = i; thus in 900 tosses of the 
dice we would expect 100 tosses to give the sum 5. 

Note that this is analogous to a relative-frequency distribution with probabilities replacing the 
relative frequencies. Thus we can think of probability distributions as theoretical or ideal limiting 



Table 6.1 



X 


2 


3 


4 5 


6 


7 


8 


9 


10 




12 


P(X) 


1/36 


2/36 


3/36 4/36 


5/36 


6/36 


5/36 


4/36 


3/36 


2/36 


1/36 



forms of relative-frequency distributions when the number of observations made is very large. For 
this reason, we can think of probability distributions as being distributions of populations, whereas 
relative-frequency distributions are distributions of samples drawn from this population. 

The probability distribution can be represented graphically by plotting p(X) against X, just as for 
relative-frequency distributions (see Problem 6.11). 

By cumulating probabilities, we obtain cumulative probability distributions, which are analogous to 
cumulative relative-frequency distributions. The function associated with this distribution is sometimes 
called a distribution function. 

The distribution in Table 6.1 can be built using EXCEL. The following portion of an EXCEL 
worksheet is built by entering Diel into Al, Die2 into Bl, and Sum into CI. The 36 outcomes on the 
dice are entered into A2:B37. In C2, =SUM(A2:B2) is entered and a click-and-drag is performed from 
C2 to C37. By noting that the sum 2 occurs once, the sum 3 occurs twice, etc., the probability distribution 
in Table 6.1 is formed. 
Die1 Die2 Sum 

1 1 2 

1 2 3 

1 3 4 

1 4 5 

1 5 6 

1 6 7 



2 1 3 

2 2 4 

2 3 5 

2 4 6 

2 5 7 
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Continuous 

The above ideas can be extended to the case where the variable X may assume a continuous set of 
values. The relative-frequency polygon of a sample becomes, in the theoretical or limiting case of a 
population, a continuous curve (such as shown in Fig. 6-3) whose equation is Y = p(X). The total area 
under this curve bounded by the X axis is equal to 1 , and the area under the curve between lines X = a 
and X = b (shaded in Fig. 6-3) gives the probability that X lies between a and b, which can be denoted 
by Pv{a < X < b}. 



P(X) 




b X 



Fig. 6-3 Pt{a<X<b} is shown as the cross-hatched area under the density function. 
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We call p(X) a probability density junction, or briefly a density function, and when such a function is 
given we say that a continuous probability distribution for X has been defined. The variable X is then often 
called a continuous random variable. 

As in the discrete case, we can define cumulative probability distributions and the associated 
distribution functions. 



MATHEMATICAL EXPECTATION 

If p is the probability that a person will receive a sum of money S, the mathematical expectation 
(or simply the expectation) is defined as pS. 

EXAMPLE 9. Find E(X) for the distribution of the sum of the dice given in Table 6.1. The distribution is given in 
the following EXCEL printout. The distribution is given in A2:B12 where the p(X) values have been converted to 
their decimal equivalents. In C2, the expression =A2*B2 is entered and a click-and-drag is performed from C2 
to C12. In C13, the expression =Sum(C2:C12) gives the mathematical expectation which equals 7. 
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The concept of expectation is easily extended. If X denotes a discrete random variable that 
can assume the values X X ,X 2 ,...,X K with respective probabilities p { , p 2 , . . . ,p K , where 

Pi+p 2 -\ VPk — 1, the mathematical expectation of X (or simply the expectation of X), denoted 

by E(X), is defined as 

E(X) = p x X x +p 2 X 2 + --- +p K X K = PJ X J = Y,P X ( 7 ) 

If the probabilities pj in this expectation are replaced with the relative frequencies f]/N, where 
N = J2Jp the expectation reduces to (^fX)/N, which is the arithmetic mean I of a sample of size 
N in which X { , X 2 ,...,X K appear with these relative frequencies. As N gets larger and larger, the 
relative frequencies f/N approach the probabilities pj. Thus we are led to the interpretation that 
E(X) represents the mean of the population from which the sample is drawn. If we call m the sample 
mean, we can denote the population mean by the corresponding Greek letter p (mu). 

Expectation can also be defined for continuous random variables, but the definition requires the use 
of calculus. 

RELATION BETWEEN POPULATION, SAMPLE MEAN, AND VARIANCE 

If we select a sample of size N at random from a population (i.e., we assume that all such samples are 
equally probable), then it is possible to show that the expected value of the sample mean m is the 
population mean p. 
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It does not follow, however, that the expected value of any quantity computed from a sample is the 
corresponding population quantity. For example, the expected value of the sample variance as we have 
defined it is not the population variance, but (N - l)/N times this variance. This is why some statisti- 
cians choose to define the sample variance as our variance multiplied by N/(N— 1). 

COMBINATORIAL ANALYSIS 

In obtaining probabilities of complex events, an enumeration of cases is often difficult, tedious, or 
both. To facilitate the labor involved, use is made of basic principles studied in a subject called combi- 
natorial analysis. 

Fundamental Principle 

If an event can happen in any one of n x ways, and if when this has occurred another event can 
happen in any one of n 2 ways, then the number of ways in which both events can happen in the specified 
order is n x n 2 . 

EXAMPLE 10. The numbers through 5 are entered into Al through A6 in EXCEL and =FACT(A1) is entered 
into Bl and a click-and-drag is performed from Bl through B6. Then the chart wizard is used to plot the points. The 
function =FACT(n) is the same as «!. For n = 0, 1, 2, 3, 4, and 5, =FACT(n) equals 1, 1, 2, 6, 24, and 120. Figure 6.4 
was generated by the chart wizard in EXCEL. 

140 -i 1 

120 - ♦ 

100- 

80- 

60 - 

40 - 

20 - * 

I + ? . . . 

1 2 3 4 5 6 

Fig. 6-4 Chart for «! generated by EXCEL. 

EXAMPLE 11. The number of permutations of the letters a, b, and c taken two at a time is 3 P 2 = 3-2 = 6. 
These are ab, ha, ac, ca, be, and cb. 

The number of permutations of n objects consisting of groups of which n y are alike, n 2 are 
alike, - is 

—r — — where n = n 1 +n 2 + --- (10) 



EXAMPLE 12. The number of permutations of letters in the word statistics is 
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COMBINATIONS 

A combination of n different objects taken r at a time is a selection of r out of the n objects, with no 
attention given to the order of arrangement. The number of combinations of n objects taken r at a time is 
denoted by the symbol (") and is given by 

D- "-""h"- , + " -h^)I 



EXAMPLE 13. The number of combinations of the letters a, b, and c taken two at a time is 

These are ab, ac, and be. Note that ab is the same combination as ba, but not the same permutation. 

The number of combinations of 3 things taken 2 at a time is given by the EXCEL 
command = COMBIN(3, 2) which gives the number 3. 



STIRLING'S APPROXIMATION TO «! 

When n is large, a direct evaluation of n\ is impractical. In such cases, use is made of an approximate 
formula developed by James Stirling: 

«!«VW e - (72) 
where e — 2.71828 • • • is the natural base of logarithms (see Problem 6.31). 



RELATION OF PROBABILITY TO POINT SET THEORY 

As shown in Fig. 6-5, a Venn diagram represents all possible outcomes of an experiment as a 
rectangle, called the sample space, S. Events are represented as four-sided figures or circles inside the 
sample space. If S contains only a finite number of points, then with each point we can associate a 
nonnegative number, called probability, such that the sum of all numbers corresponding to all points in S 
add to 1. An event is a set (or collection) of points in S, such as indicated by E\ and E 2 in Fig. 6-5. 



EULER OR VENN DIAGRAMS AND PROBABILITY 

The event E x + E 2 is the set of points that are either in E x or E 2 or both, while the event E X E 2 is 
the set of points common to both E x and E 2 . Thus the probability of an event such as E x is the sum of 
the probabilities associated with all points contained in the set E x . Similarly, the probability of E x + E 2 , 
denoted by Pr{E x + E 2 }, is the sum of the probabilities associated with all points contained in 
the set Ei + E 2 . If E x and E 2 have no points in common (i.e., the events are mutually exclusive), 
then Vr{Ei + E 2 ] = Pr{£j + Pr{£ 2 }- If they have points in common, then Pr{Ei + E 2 } = 
Pr{E x } + Pr{E 2 } - Pr{EiE 2 }. 

The set E x + E 2 is sometimes denoted by E x U E 2 and is called the union of the two sets. The set E X E 2 
is sometimes denoted by E x n E 2 and is called the intersection of the two sets. Extensions to more than 
two sets can be made; thus instead of E x + E 2 + E 3 and E X E 2 E 3 , we could use the notations E x U E 2 U E 3 
and Ei C~iE 2 nE 3 , respectively. 

The symbol <j> (the Greek letter phi) is sometimes used to denote a set with no points in it, called 
the null set. The probability associated with an event corresponding to this set is zero (i.e., Pr{0} = 0). 
If Ei and E 2 have no points in common, we can write E x E 2 = (p, which means that the corresponding 
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Fig. 6-5 Operations on events, (a) Complement of event E shown in gray, written as E; (b) Intersectio 
and E 2 shown in gray, written as E\ n E 2 \ (c) Union of events E\ and E 2 shown in gra\ . writd 
(d) Mutually exclusive events E x and E 2 , that is £, n E 2 = <|>. 



events are mutually exclusive, whereby Pi{E l E 2 } = 0. 

With this modern approach, a random variable is a function defined at each point of the 
sample space. For example, in Problem 6.37 the random variable is the sum of the coordinates of 
each point. 

In the case where 5 has an infinite number of points, the above ideas can be extended by using 
concepts of calculus. 



EXAMPLE 14. An experiment consists of rolling a pair of dice. Event E x is the event that a 7 occurs, i. e., the sum 
on the dice is 7. Event E 2 is the event that an odd number occurs on die 1. Sample space S, events E\ and E 2 are 
shown. Find Pr{£,}, Pr{£ 2 }, Pr{£in£ 2 } and Pr(£iU£ 2 )- E u E 2 , and S are shown in separate panels of the 
MINITAB output shown in Fig. 6-6. 

P(E 1 ) = 6/36 =1/6 P{E 2 ) = 18/36 = 1/2 P(E 1 r\E 2 ) = 3/36 = 1/12 
P(Ei U E 2 ) = 6/36 + 18/36 - 3/36 = 21 /36. 



148 



ELEMENTARY PROBABILITY THEORY 
sample space S 



[CHAP. 6 



Fig. 6-6 MINITAB output for Example 14. 



Solved Problems 



FUNDAMENTAL RULES OF PROBABILITY 

6.1 Determine the probability p, or an estimate of it, for each of the following events: 

(a) An odd number appears in a single toss of a fair die. 

(b) At least one head appears in two tosses of a fair coin. 

(c) An ace, 10 of diamonds, or 2 of spades appears in drawing a single card from a well-shuffled 
ordinary deck of 52 cards. 

(d) The sum 7 appears in a single toss of a pair of fair dice. 

(<?) A tail appears in the next toss of a coin if out of 100 tosses 56 were heads. 

SOLUTION 

(a) Out of six possible equally likely cases, three cases (where the die comes up 1,3, or 5) are favorable to 
the event. Thus P = \ = \- 

(b) If H denotes "head" and T denotes "tail," the two tosses can lead to four cases: HH, HT, TH, and TT, 
all equally likely. Only the first three cases are favorable to the event. Thus p = |. 

(c) The event can occur in six ways (ace of spades, ace of hearts, ace of clubs, ace of diamonds, 10 of 
diamonds, and 2 of spades) out of 52 equally likely cases. Thus p = ^ = ^. 

(d) Each of the six faces of one die can be associated with each of the six faces of the other die, so that the 
total number of cases thai can arise, all equalh likely, is 6 • 6 = 36. These can be denoted by (1, 1), (2, 1), 
(3,1),..., (6, 6). 

There are six ways of obtaining the sum 7, denoted by (1,6), (2,5), (3,4), (4,3), (5,2), and (6, 1). 
Thus p = = 

(<?) Since 100 - 56 = 44 tails were obtained in 100 tosses, the estimated (or empirical) probability of a tail 
is the relative frequency 44/100 = 0.44. 
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An experiment consists of tossing a coin and a die. If E x is the event that "head" comes up in 
tossing the coin and E 2 is the event that "3 or 6" comes up in tossing the die, state in words the 
meaning of each of the following: 

(a) E x (c) E X E 2 (e) Pr{E x \E 2 } 

(b) E 2 (d) Pv{E x E 2 } (J) Pr{E x +E 2 } 

SOLUTION 

(a) Tails on the coin and anything on the die 

(b) 1, 2, 4, or 5 on the die and anything on the coin 

(c) Heads on the coin and 3 or 6 on the die 

id) Probability of heads on the coin and 1, 2, 4, or 5 on the die 

(e) Probability of heads on the coin, given that a 3 or 6 has come up on the die 

(/) Probability of tails on the coin or 1, 2, 4, or 5 on the die, or both 

A ball is drawn at random from a box containing 6 red balls, 4 white balls, and 5 blue balls. 
Determine the probability that the ball drawn is (a) red, (b) white, (c) blue, (d) not red, and 
(e) red or white. 

SOLUTION 

Let R, W, and B denote the events of drawing a red ball, white ball, and blue ball, respectively. Then: 
ways of choosing a red ball 6 6 2 



it) Pr w = 6T4Tr 

«> Pr ^ = 6+4+r 

id) ^{R} = \-Pr{R} = \--= J - 



(e) Px{R + W) = 



ways of choosing a red or white ball _ 6 + 4 



total ways of choosing a ball 6 + 4 + 5 15 3 

Another method 



Pr{i? + W } = Pr{B} = 1 - Pr{B} = 1 - - = - by part (c) 

Note that Pr{,R+ W} = Pr{R} + Pr{ W} (i.e., f = § + ^). This is an illustration of the general rule 
Pr{£i +E 2 } = Pr{E{\ + Pr{£ 2 } that is true for mutually exclusive events E x and E 2 . 



6.4 A fair die is tossed twice. Find the probability of getting a 4, 5, or 6 on the first toss and a 1, 2, 3, 
or 4 on the second toss. 

SOLUTION 

Let E x = event "4, 5, or 6" on the first toss, and let E 2 = event "1,2, 3, or 4" on the second toss. Each of 
the six ways in which I he die can fall on the first toss can be associated with each of the six ways in which it 
can fall on the second toss, a total of 6 ■ 6 = 36 ways, all equally likely. Each of the three ways in which E\ 
can occur can be associated with each of the four ways in which E 2 can occur, to give 3 • 4 = 12 ways in 
which both £, and E 2 , or E X E 2 occur. Thus Pr{E x E 2 } = 12/36 = 1/3. 

Note that Prf^,^} = Pr{£i} Pr{£ 2 } (i.e., \ = I ■ 5) is valid for the independent events E x and E 2 . 
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6.5 Two cards are drawn from a well-shuffled ordinary deck of 52 cards. Find the probability that 
they are both aces if the first card is (a) replaced and (b) not replaced. 

SOLUTION 

Let E x = event "ace" on the first draw, and let E 2 = event "ace" on the second draw. 

(a) If the first card is replaced, E x and E 2 are independent events. Thus Pr{both cards drawn are 
aces} = Vr{E x E 2 } = Pr{£,} Pr{£ 2 } = (A)(4) = 

(b) The first card can be drawn in any one of 52 ways, and the second card can be drawn in any one 
of 51 ways since the first card is not replaced. Thus both cards can be drawn in 52-51 ways, 
all equally likely. 

There are four ways in which E x can occur and three ways in which E 2 can occur, so that both E x 
and E 2 , or E t E 2 , can occur in 4 ■ 3 ways. Thus Prf^^} = (4 ■ 3)/(52 • 51) = ± 

Note that ~Pv{E 2 \E x } = Pr{second card is an ace given that first card is an ace} = jj. Thus our 
result is an illustration of the general rule that Pr{E x E 2 } = Pr{£,} Pr{£ 2 |£i} when E x and E 2 are 
dependent events. 



Three balls are drawn successively from the box of Problem 6.3. Find the probability 
that they are drawn in the order red, white, and blue if each ball is (a) replaced and (b) 
not replaced. 



Let R = event "red" on the first draw, W = event "white" on the second draw, and B = event "blue" 
on the third draw. We require Vr{RWB}. 

(a) If each ball is replaced, then R, W, and B are independent events and 

(b) If each ball is not replaced, then R, W, and B are dependent events and 

Pr { ***> = Pr W Pr W} Vr { B\WR } = (^) (^) (^) 

= (is) (14) (n) = 9T 

where Pr{B\ WR} is the conditional probability of getting a blue ball if a white and red ball have already 
been chosen. 



Find the probability of a 4 turning up at least once in two tosses of a fair die. 
SOLUTION 



Let E x — event "4" c 
on the first toss or "< 
We require Pr{£, + E 2 }. 

First method 

The total number c 



the first toss, E 2 = event "4" on the second toss, and E x + E 2 = event "4" 
on the second toss or both = event that at least one 4 turns up. 



' equally likely ways 

Number of ways 
Number of ways 
Number of ways 



which both dice can fall is 6 • 6 = 36. Also, 
which E x occurs but not E 2 = 5 
which E 2 occurs but not E x = 5 
which both E x and E 2 occur = 1 



Thus the number of ways ii 
Pr{£ 1+ £ 2 } = li. 



which at least one of the events E x or E 2 occurs is 5 + 5 + 1 = 1 1, and thus 
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Second method 

Since E x and E 2 are not mutually exclusive, Pr{£, + E 2 } = Pr{E x } + Pr{E 2 } - Vr{E l E 2 }. Also, since 
Ei and E 2 are independent, Pr{E x E 2 } = Pr{E x } Pr{£ 2 }. Thus Prj^ + £ 2 } = Pr{£,} + Pr{£ 2 }- 
Pr{£ 1 }Pr{£ 2 }=i + i-(I)(i)=ii. 

Third method 

Pr{at least one 4 comes up} + Pr{no 4 comes up} = 1 
Thus Pr{at least one 4 comes up} = 1 - Pr{no 4 comes up} 

= 1 - Pr{no 4 on first toss and no 4 on second toss} 
= 1 - Pr{E x E 2 \ = 1 - Pr{£,} Pr{£ 2 } 

= i - = M 



One bag contains 4 white balls and 2 black balls; another contains 3 white balls and 5 black balls. 
If one ball is drawn from each bag, find the probability that (a) both are white, (b) both are black, 
and (c) one is white and one is black. 

SOLUTION 

Let W x = event "white" ball from the first bag, and let W 2 = event "white" ball from the second bag. 
(«) Pr{ Wi W 2 } = Pr{ W x } Pr{ W 2 } = (^_) = l _ 

(b) Pr{ WiW 2 } = Pr{ Wi } Pr{ W l} = (^) (^_) = A 

(c) The event "one is white and one is black" is the same as the event "either the first is white and the 
second is black or the first is black and the second is white"; that is, W x W 2 + W X W 2 . Since events 
W x W 2 and W x W 2 are mutually exclusive, we have 

Pr{ Wi W 2 +WiW 2 } = Pr{ W x W 2 } + Pr{ W x W 2 } 

= Pr{ W x } Pr{ W 2 } + Pr{ W x } Pr{ W 2 \ 



(4 + 2) (3 + 5) + (4 + 2) (3 + 5) 24 



Another method 

The required probability is 1 - Pr{ W x W 2 ) - Pr{ W x W 2 ) = 1 - \ - 



6.9 A and B play 12 games of chess, of which 6 are won by A, 4 are won by B, and 2 end in a 
draw. They agree to play a match consisting of 3 games. Find the probability that (a) 
A wins all 3 games, (b) 2 games end in a draw, (c) A and B win alternately, and (d) B wins 
at least 1 game. 

SOLUTION 

Let A x , A 2 , and A 3 denote the events "A wins" in the first, second, and third games, respectively; let B x , 
B 2 , and B 3 denote the events "B wins" in the first, second, and third games, respectively; and let, D x , D 2 , and 
Z> 3 denote the events "there is a draw" in the first, second, and third games, respectively. 

On the basis of their past experience (empirical probability), we shall assume that Pr{^ wins any one 
game} = f 2 = that Pr{_S wins any one game} = ^ = 5, and that Pr{any one game ends in a draw} ~j 2 = \- 



(a) PrL4 wins all 3 games} = Pr{^!^ 2 ^ 3 } = Prl^,} Pr{^ 2 } Pr{^ 3 } = 
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assuming that the results of each game are independent of the results of any others, which appears to be 
justifiable (unless, of course, the players happen to be psychologically influenced by the other one's 
winning or losing). 

(b) Pr{2 games end in a draw} = Pr{lst and 2nd or 1st and 3rd or 2nd and 3rd games end in a draw} 

= Pr{D 1 D 2 5 3 } + Pr{D i D 2 D 3 } + Pt{D ] D 2 D 3 } 
= Pr{D x } Pr{D 2 \ Pr{D 3 } + Pr{Z>, } Pr{D 2 } Pr{£> 3 } 
+ Pr{D t }Pv{D 2 } Pv{D 3 } 

-©©©♦©©©♦©©©-£-£ 

(c) Pr{^ and B win alternately} = Pr{^ wins then B wins then A wins or B wins then A wins then B wins} 

= Pr{A t B 2 A 3 + BiA 2 B 3 } = Pr{A x B 2 A 3 } + Pr}^^} 
= Pr{^} Pi{B 2 } PrL4 3 } + Pr}^} Pr{^ 2 } Pr{5 3 } 

-®6)© + G)©6)-i 

(d) Pr{B wins at least 1 game} = 1 - Pr{B wins no game} 

= 1 - PviBiB^} = 1 - Pr{5,} Pr{B 2 } Pr{B 3 \ 

PROBABILITY DISTRIBUTIONS 

6.10 Find the probability of boys and girls in families with three children, assuming equal probabilities 
for boys and girls. 

SOLUTION 

Let B = event "boy in the family," and let G = event "girl in the family." Thus according to the 
assumption of equal probabilities, Pr{i?} = Pr{G} = \. In families of three children the following mutually 
exclusive events can occur with the corresponding indicated probabilities: 

(a) Three boys (BBB): 

Pr{BBB} = Pr{B} Pr{B} Pr{B} = * 

Here we assume that the birth of a boy is not influenced in any manner by the fact that a previous child 
was also a boy, that is, we assume that the events are independent. 

(b) Three girls (GGG): As in part (a) or by symmetry, 

Pr{GGG} = l - 

(c) Two boys and one girl (BBG + BGB + GBB): 

Pr{BBG + BGB + GBB} = Pr{BBG} + Pr{BGB\ + Pr{GBB\ 

= Pr{B} Pr{B} Pr{G} + Pr{B} Pr{G} Pr{B\ + Pr{G} Pr{B} Pr{B\ 
_ 1 1 1 _ 3 
~8 + 8 + 8~ 8 

(d) Two girls and one boy (GGB + GBG + BGG): As in part (c) or by symmetry, the probability is f . 

If we call X the random variable showing the number of boys in families with three children, the 
probability distribution is as shown in Table 6.2. 
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Table 6.2 



Number of boys X 


12 3 


Probability p(X) 


1/8 3/8 3/8 1/8 



6.11 Graph the distribution of Problem 6.10. 
SOLUTION 

The graph can be represented either as in Fig. 6-7 or Fig. 6-8. Note that the sum of the areas of the 
rectangles in Fig. 6-8 is 1; in this figure, called a pmhahility liisioitraiu. we are considering Jfasa continuous 
variable even though it is actually discrete, a procedure that is often found useful. Figure 6-7, on the other 
hand, is used when one does not wish to consider the variable a 



Fig. 6-7 SPSS plot of probability distribution. 



Boys 

Fig. 6-8 MINITAB probability histogra 
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6.12 A continuous random variable X, having values only between and 5, has a density function 
given by p(X) = j ^j^^ 5 ■ The graph is shown in Fig. 6-9. 



(a) Verify that it is a density function. 



Fig. 6-9 Probability density function for the variable X. 



(b) Find and graph Pr{ 2.5<Z<4.0}. 



SOLUTION 

(a) The function p{X) is always > and the total area under the graph of p(X) is 5 x 0.2 = 1 since it is 
rectangular in shape and has width 0.2 and length 5 (see Fig. 6-9). 

(b) The probability Pr{ 2.5<X<4.0} is shown in Fig. 6-10. 



0.250 
0.225 
& 0.200 
0.175 
0.150 

1 2 3 4 5 

X 

Fig. 6-10 The probability Pr(2.5 <X<4.0) is shown as the cross-hatched area. 




The rectangular area, Pr{2.5<Z<4.0}, is (4 - 2.5) x 0.2 = 0.3. 
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MATHEMATICAL EXPECTATION 

6.13 If a man purchases a raffle ticket, he can win a first prize of $5000 or a second prize of $2000 with 
probabilities 0.001 and 0.003. What should be a fair price to pay for the ticket? 

SOLUTION 

His expectation is ($5000)(0.001) + ($2000) (0.003) = $5 + $6 = $11, which is a fair price to pay. 

6.14 In a given business venture a lady can make a profit of $300 with probability 0.6 or take a loss of 
$100 with probability 0.4. Determine her expectation. 

SOLUTION 

Her expectation is ($300) (0.6) + (-$100)(0.4) = $180 - $40 = $140. 

6.15 Find (a) E(X), (b) E(X 2 ), and (c) E[(X - X) 2 )] for the probability distribution shown in 
Table 6.3. 

(d) Give the EXCEL solution to parts (a), (b), and (c). 



Table 6.3 



X 


8 12 16 20 24 


p{x) 


1/8 1/6 3/8 1/4 1/12 



SOLUTION 

(a) E(X) = J2 Xp(X) = (8)(I) + (12)0 + (16)(|) + (20)© + (24)(£) = 16; this represents the mean of 
the distribution. 

(b) E{X 2 ) = Y, Y 2 pW = (8) 2 (|) + (12) 2 (i) + (16) 2 (|) + (20) 2 (i) + (24) 2 (|L)= 276; this represents the 
second moment about the origin zero. 

(c) E[(X - X) 2 ] = £(X - X) 2 p{X) = (8 - 16) 2 (i) + (12 - 16) 2 (I) + (16 - 16) 2 (|) + (20 - 16) 2 (I)+ 
(24 — 16) (j2) = 20; this represents the variance of the distribution. 

(d) The labels are entered into A1:E1 as shown. The lvalues and the probability values are 
entered into A2:B6. The expected values of X are computed in C2:C7. The expected value is 
given in C7. The second moment about the origin is computed in D2:D7. The second 
moment is given in D7. The variance is computed in E2:E7. The variance is given in E7. 



A 


B 


C 


D 


E 


X 

8 


P(X) 
0.125 


Xp(X) 
1 


XA2p(X) 
8 


(X - £(X))A 2 *p(X) 
8 


12 


0.166667 


2 


24 


2.666666667 


16 


0.375 


6 


96 





20 


0.25 


5 


100 


4 


24 


0.083333 


2 


48 


5.333333333 






16 


276 


20 



6.16 A bag contains 2 white balls and 3 black balls. Each of four persons, A, B, C, and D, in the order 
named, draws one ball and does not replace it. The first to draw a white ball receives $10. 
Determine the expectations of A, B, C, and D. 
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Since only 3 black balls are present, one person must win on his or her first attempt. Denote by A, B, C, 
and D the events "A wins," wins," "C wins," and "D wins," respectively. 

PrL4 wins} = Pr{^}=^ I = ^ 
Thus ,4's expectation = § ($10) = $4. 

Pr{^ loses and B wins} = Pr{AB} = Pr{^} Pr{B\A} = 
Thus 5's expectation = $3. 

Pr{A and B lose and C wins} = ¥t{ABC} = Pr{A} Pr{B\A} ¥t{CAB} = (^j (^j = ^ 
Thus C's expectation = $2. 

Pr{^, B, and C lose and D wins} = Pr{ABCD} 

= Pr{A\ Pr{B\A} Pr{C\AB} Pr{D\ABC} 

-©©©OK 

Thus Z)'s expectation = $1. 

Cfec/t: $4 + $3 + $2 + $l = $10, and f + ^ + i + 4 = 1. 



PERMUTATIONS 

6.17 In how many ways can 5 differently colored marbles be arranged in a row? 
SOLUTION 

We must arrange the 5 marbles in 5 positions: . The first position can be occupied by any 

one of 5 marbles (i.e., there are 5 ways of filling the first position). When this has been done, there are 4 ways 
of filling the second position. Then there are 3 ways of filling the third position, 2 ways of filling the fourth 
position, and finally only 1 way of filling the last position. Therefore: 

Number of arrangements of 5 marbles in a row = 5 • 4 • 3 • 2 • 1 = 5! = 120 

In general, 

Number of arrangements of n different objects in a row = n(n — l)(w — 2) • • • 1 = n\ 
This is also called the number of permutations of different objects taken n at a time and is denoted by „P„. 



6.18 In how many ways can 10 people be seated on a bench if only 4 seats are available? 
SOLUTION 

The first seat can be filled in any one of 10 ways, and when this has been done there are 9 ways of filling 
the second seat, 8 ways of filling the third seat, and 7 ways of filling the fourth seat. Therefore: 
Number of arrangements of 10 people taken 4 at a time = 10 • 9 ■ 8 • 7 = 5040 

In general, 



Number of arrangements of n different objects taken r at a time = n{n- 1) •••(«- r + 1) 
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This is also called the number of permutations of n different objects taken r at a time and is denoted by n P r , 
P(n, r) or P„,. Note that when r = n, n P„ = n\, as in Problem 6.17. 

6.19 Evaluate (a) & P 3 , (b) 6 P 4 , (c) 15 Pi, and 3 P 3 , and (<?) parts (a) through (d) using EXCEL. 
SOLUTION 

(a) g P 3 = 8 • 7 • 6 = 336, (b) 6 P 4 = 6 • 5 • 4 • 3 = 360, (c) 15 P, = 15, and (d) 3 P 3 = 3 • 2 • 1 = 6, 
(e) =PERMUT(8, 3) = 336 = PERMUT(6, 4) = 360 
=PERMUT(15,1) = 15 =PERMUT(15,1) = 6 

6.20 It is required to seat 5 men and 4 women in a row so that the women occupy the even places. 
How many such arrangements are possible? 

SOLUTION 

The men may be seated in 5 P 5 ways and the women in 4 P 4 ways; each arrangement of the men may be 
associated with each arrangement of the women. Hence the required number of arrangements is 
5 P 5 ■ 4P4 = 5!4! = (120) (24) = 2880. 

6.21 How many four-digit numbers can be formed with the 10 digits 0, 1, 2, 3, . . . , 9, if (a) repetitions 
are allowed, (b) repetitions are not allowed, and (c) the last digit must be zero and repetitions are 
not allowed? 

SOLUTION 

(a) The first digit can be any one of 9 (since is not allowed). The second, third and fourth digits can be 
any one of 10. Then 9 • 10 • 10 • 10 = 9000 numbers can be formed. 

(b) The first digit can be any one of 9 (any one but 0). 

The second digit can be any one of 9 (any but that used for the first digit). 
The third digit can be any one of 8 (any but those used for the first two digits). 
The fourth digit can be any one of 7 (any but those used for the first three digits). 
Thus 9 • 9 • 8 • 7 = 4536 numbers can be formed. 

Another method 

The first digit can be any one of 9 and the remaining three can be chosen in 9 P 3 ways. Thus 
9 • 9P3 = 9 • 9 • 8 • 7 = 4536 numbers can be formed. 

(c) The first digit can be chosen in 9 ways, the second in 8 ways, and the third in 7 ways. Thus 9 • 8 ■ 7 = 504 
numbers can be formed. 

Another method 

The first digit can be chosen in 9 ways and the next two digits in 9 P 2 ways. Thus 
9 • 8 P 2 = 9 • 8 • 7 = 504 numbers can be found. 

6.22 Four different mathematics books, 6 different physics books, and 2 different chemistry books are to 
be arranged on a shelf. How many different arrangements are possible if (a) the books in each 
particular subject must all stand together and (b) only the mathematics books must stand together? 

SOLUTION 

(a) The mathematics books can be arranged among themselves in 4 P 4 = 4! ways, the physics books in 
6 P 6 = 6! ways, the chemistry books in 2P2 = 2! ways, and the three groups in 3 P 3 = 3! ways. Thus the 
required number of arrangements = 4! 6! 2! 3! = 207,360. 
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(b) Consider the 4 mathematics books as one big book. Then we have 9 books that can be arranged 

in g P g = 9! ways. In all of these ways the mathematics books are together. But the mathematics 

books can be arranged among themselves in 4 P 4 = 4! ways. Thus the required number of 
arrangements = 9! 4! = 8,709,120. 

6.23 Five red marbles, 2 white marbles, and 3 blue marbles are arranged in a row. If all the marbles of 
the same color are not distinguishable from each other, how many different arrangements are 
possible? Use the EXCEL function =MULTINOMIAL to evaluate the expression. 
SOLUTION 

Assume that there are P different arrangements. Multiplying P by the numbers of ways of arranging 
(a) the 5 red marbles among themselves, (b) the 2 white marbles among themselves, and (c) the 3 blue 
marbles among themselves (i.e., multiplying P by 5! 2! 3!), we obtain the number of ways of arranging the 10 
marbles if they are distinguishable (i.e., 10!). Thus 

10' 

(5!2!3!)P=10! and P = — 

In general, the number of different arrangements of n objects of which n x are alike, n 2 are alike, . . . ,n k 
are alike is 



n x \n 2 \---n k \ 

where n\ + n 2 + ■ ■ ■ + n k = n. 

The EXCEL function =MULTINOMIAL(5,2,3) gives the value 2520. 

6.24 In how many ways can 7 people be seated at a round table if (a) they can sit anywhere and 
(b) 2 particular people must not sit next to each other? 

SOLUTION 

(a) Let 1 of them be seated anywhere. Then the remaining 6 people can be seated in 6! = 720 ways, which is 
the total number of ways of arranging the 7 people in a circle. 

(b) Consider the 2 particular people as 1 person. Then there are 6 people altogether and they can be 
arranged in 5! ways. But the 2 people considered as 1 can be arranged among themselves in 2! ways. 
Thus the number of ways of arranging 6 people at a round table with 2 particular people sitting 
together = 5!2! = 240. 

Then using part (a), the total number of ways in which 6 people can be seated at a round table so 
that the 2 particular people do not sit together = 720 - 240 = 480 ways. 



COMBINATIONS 



6.25 In how many ways can 10 objects be split into two groups containing 4 and 6 objects, respectively. 
SOLUTION 

This is the same as the number of arrangements of 10 objects of which 4 objects are alike and 6 other 
objects are alike. By Problem 6.23, this is 

i^ 10 - 9 - 8 - 7 = 210 
4!6! 4! 

The problem is equivalent to finding the number of selections of 4 out of 10 objects (or 6 out of 10 
objects), the order of selection being immaterial. 

In general the number of selections of r out of n objects, called the number of combinations of n things 
taken r at a time, is denoted by (") and is given by 



O-Hi 



n(n-\)...{n-r+\) 
(n-r)\ ~ r\ 
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6.26 Evaluate (a) Q, (b) (!?), (c) (*), and (d) parts (a) through (c) using EXCEL. 
SOLUTION 

p\ 7! 7- 6- 5-4 7- 6- 5 

» (9 -(9- 

(c) (*) is the number of selections of 4 objects taken all at a time, and there is only one such selection; 
thus (4) = 1, Note that formally 

( A )= — =1 
\4/ 4!0! 

if we <fe/z«e 0! = 1. 

(rf) = COMBIN(7, 4) gives 35, = COMBIN(6, 5) gives 6, and 
= COMBIN(4, 4) gives 1. 

6.27 In how many ways can a committee of 5 people be chosen out of 9 people? 
SOLUTION 

(9\ 9! 9-S-7-6-5 
W = 5!4! = 5! =126 

6.28 Out of 5 mathematicians and 7 physicists, a committee consisting of 2 mathematicians and 3 
physicists is to be formed. In how many ways can this be done if (a) any mathematician and any 
physicist can be included, (b) one particular physicist must be on the committee, and (c) two 
particular mathematicians cannot be on the committee? 

SOLUTION 

(a) Two mathematicians out of 5 can be selected in Q ways, and 3 physicists out of 7 can be selected in Q 
ways. The total number of possible selections is 

(b) Two mathematicians out of 5 can be selected in ( 5 2 ) ways, and 2 additional physicists out of 6 can be 
selected in ( 6 2 ) ways. The total number of possible selections is 

©■©-">■"-'» 

(c) Two mathematicians out of 3 can be selected in (2) ways, and 3 physicists out of 7 can be selected in (3) 
ways. The total number possible selections is 

G)-G)=— 

6.29 A girl has 5 flowers, each of a different variety. How many different bouquets can she form? 
SOLUTION 

Each flower can be dealt with in 2 ways: It can be chosen or not chosen. Since each of the 2 ways of 
dealing with a flower is associated with 2 ways of dealing with each of the other flowers, the number of ways 
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of dealing with the 5 flowers = 2 5 . But these 2 5 ways include the case in which no flower is chosen. Hence the 
required number of bouquets = 2 5 - 1 = 31 . 

Another method 

She can select either 1 out of 5 flowers, 2 out of 5 flowers, . . . , 5 out of 5 flowers. Thus the required 
number of bouquets is 



6.30 From 7 consonants and 5 vowels, how many words can be formed consisting of 4 different 
consonants and 3 different vowels? The words need not have meaning. 

SOLUTION 

The 4 different consonants can be selected in Q) ways, the 3 different vowels can be selected in (J) ways, 
and the resulting 7 different letters (4 consonants and 3 vowels) can then be arranged among themselves 
in 1 P 1 = 7! ways. Thus the number of words is 



STIRLING'S APPROXIMATION TO n! 

6.31 Evaluate 50!. 
SOLUTION 

For large n, we have n\ m \Jl-Knn" e~ n ; thus 

50!^ v / 2^(50) 50 50 e- 50 = 5 
To evaluate S, use logarithms to the base 10. Thus 

log S = log (VKKhr50 50 e- 50 ) = \ log 100 + \ log n + 50 log 50 - 50 log e 

= A log 100 + i log 3.142 + 50 log 50 - 50 log2.718 
= | (2) + i (0.4972) + 50(1 .6990) - 50(0.4343) = 64.4846 
from which 5* = 3.05 x 10 64 , a number that has 65 digits. 

PROBABILITY AND COMBINATORIAL ANALYSIS 

6.32 A box contains 8 red, 3 white, and 9 blue balls. If 3 balls are drawn at random, determine the 
probability that (a) all 3 are red, (b) all 3 are white, (c) 2 are red and 1 is white, (d) at least 1 is 
white, (e) 1 of each color is drawn, and (/) the balls are drawn in the order red, white, blue. 
SOLUTION 

(a) First method 



Let R\, R 2 , and i? 3 denote the events "red ball on first draw," "red ball on second draw," and "red 
ball on third draw," respectively. Then R l R 2 Ri, denotes the event that all 3 balls drawn are red. 




In general, for any positive integer n. 





7! = 35- 10-5040 = 1,764,000 



Pritf,^} = Pr{iM Pr{* 2 |iM Pr{* 3 |*i* 2 } = Q (^) Q 



14 

285 
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Second method 



number of selections of 3 out of 20 balls ^20 j 285 
(b) Using the second method of part (a), 



Pr{all 3 are white} = -^l 



20\ 1140 



The first method of part (a) can also be used. 

ut\ /selections of 1 out\ /8\ /3\ 
) \ of 3 white balls ) _ \2j \l) _ 7 



(c) Pr{2 are red and 1 is white} = 



/ selections of 2 out^ ( selections of 1 
^ of 8 red balls 
number of selections of 3 out of 20 b 



{d) Pr{none is white} = W = ^ so Pr{at least 1 is white} = 1 - = g 



Pr{l of each color is drawn} = ^ /Vn\ 



(/) Using part (e), 

Pr{balls drawn in order red, white, blue} = ^ Pr{l of each color is drawn} = ^ (j^J = ^ 
Another method 

Pr{*, W 2 B 2 } = Pr{ Rl }Pr{W 2 \R,} Pr{2» 3 |*, W 2 } = (A) (A) Q = A 

6.33 Five cards are drawn from a pack of 52 well-shuffled cards. Find the probability that (a) 4 are 
aces; (b) 4 are aces and 1 is a king; (c) 3 are 10's and 2 are jacks; (d) a 9, 10, jack, queen, and king 
are obtained in any order; (e) 3 are of any one suit and 2 are of another; and (J) at least 1 ace is 
obtained. 



SOLUTION 



,>-f 4 v (iL_l 



(l) 

Pr{4aces and 1 king} = Q . W = _j_ 
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Pr{3 are 10's and 2 are jacks} = 



3 J /52\ 108,290 



( 4 H 4 H 4 H 4 V( 4 ) 

(</) Pr{9, 10, jack, queen, king in any order} = / 5 ' 2 \ = 1 52435 

(e) Since there are 4 ways of choosing the first suit and 3 ways of choosing the second suit, 
Pr{3 of any one suit, 2 of another} = — — J^\~ — = 4T65 

or 1 \V 35 > 673 a nr*i ♦ , 1 , 35 ' 673 18 < 482 

(/) Pr{noace}=W = __ and Pr{at least 1 ace} = 1 - — = — 



6.34 Determine the probability of three 6's in five tosses of a fair die. 
SOLUTION 

Let the tosses of the die be represented by the 5 spaces . In each space we will have either the 

events 6 or non-6 (6); for example, three 6's and two non-6's can occur as 6 6 6 6 6 or as 6 6 6 6 6, etc. 
Now the probability of an event such as 6 6 6 6 6 is 

Pr{6 6 6 6 6} = Pr{6} Pr{6} Pr{6} Pr{6} Pr{6} = I • I -L l -- 5 - = Q ' * 

Similarly, Pr{66666} = (^) '(|) : . etc., for all events in which three 6's and two non-6's occur. But there are 
(3) = 10 such events, and these events are mutually exclusive; hence the required probability is 

Pr {66 666 or 66666 oretc.}^(5(I) 3 @ 2 ^ 

In general, if p = Pr{£} and q = Pr{£}, then by using the same reasoning as given above, the prob- 
ability of getting exactly X £"s in N trials is ( N x )p x q N ~ x . 



6.35 A factory finds that, on average, 20% of the bolts produced by a given machine will be defective 
for certain specified requirements. If 10 bolts are selected at random from the day's production of 
this machine, find the probability (a) that exactly 2 will be defective, (b) that 2 or more will be 
defective, and (c) that more than 5 will be defective. 

SOLUTION 

(a) Using reasoning similar to that of Problem 6.34, 

Pr{2 defective bolts} = (0.2) 2 (0.8) 8 = 45(0.04)(0.1678) = 0.3020 

(b) Pr{2 or more defective bolts} = 1 - Pr{0 defective bolts} - Pr{ 1 defective bolt} 

= 1-(^ 1 )(0.2) (0.8) IO -(^ 1 1 )(0.2) 1 (0.8) 9 

= 1-(0.8) 10 -10(0.2)(0.8) 9 
= 1 -0.1074- 0.2684 = 0.6242 
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Prjmore than 5 defective bolts} = Pr{6 defective bolts} + Pr{7 defective bolts} 
+ Pr{8 defective bolts} + Pr{9 defective bolts} 
+ Pr{10 defective bolts} 



= ( 1 6 °)(0.2) 6 (0.8) 4 + ( 1 7 °)(0.2) 7 (0.8) 3 + ^ 8 °)(0.2) 8 (0.8) 2 



j (0.2)^(0.8) 4 
= 0.00637 



6.36 If 1000 samples of 10 bolts each were taken in Problem 6.35, in how many of these samples 
would we expect to find (a) exactly 2 defective bolts, (b) 2 or more defective bolts, and (c) more 
than 5 defective bolts? 
SOLUTION 

(a) Expected number = (1000)(0.3020) = 302, by Problem 6.35(a). 

(b) Expected number = (1000)(0.6242) = 624, by Problem 6.35(A). 

(c) Expected number = (1000) (0.00637) = 6, by Problem 6.35(c). 



EULER OR VENN DIAGRAMS AND PROBABILITY 



6.37 Figure 6-1 1 shows how to represent the sample space for tossing a fair coin 4 times and event E x 
that exactly two heads and two tails occurred and event E 2 that the coin showed the same thing on 
the first and last toss. This is one way to represent Venn diagrams and events in a computer 
worksheet. 



sample space 



event E1 event E2 



Fig. 6-11 EXCEL display of sample space and e 



The outcomes in E x have an X beside them under event E x and the outcomes in E 2 have a Y beside them 
under event E 2 . 

(a) Give the outcomes in E x n E 2 and £, U E 2 . 

(b) Give the probabilities Pr^ n E 2 } and Pr{£, U E 2 }. 

SOLUTION 

(a) The outcomes in E x n E 2 have an X and a Y beside them. Therefore £, n E 2 consists of outcomes htth 
and thht. The outcomes in £, U E 2 have an X, Y, or X and Y beside them. The outcomes in £, U E 2 are: 
hhhh, hhth, hhtt, hthh, htth, thht, thth, thtt, tthh, ttht, and tttt. 

(b) Pr{£ 1 n£ 2 } = 2/16= 1/8 or 0.125. Pr{£i UE 2 } = 12/16 = 3/4 or 0.75. 
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6.38 Using a sample space and Venn diagrams, show that: 

(a) Pr{A U B} = Pr{A} + Pr{B} - Pr{A n B} 

(b) Pr{A U£UC} = Pr{A} + Pr{B} + Pr{C} - Pv{A n B} - Pi{B(l C}~ 
Pr{A n C} + Pr{A n 5 n C} 



SOLUTION 

(a) The nonmutually exclus 
and in5. 



n A U 5 is expressible as the mutually exclus 



\ 








\ 








Ar>B 



not An B, BnA, 



Fig. 6-12 A union expressed as a disjoint u 



Pr{^ U B} = Pr{A n B} + Pr{B n A} + Pr{A n 5} 
Now add and subtract Pr{y4 nB} from the right-hand side of this equation. 

Pr{yl U B} = Pr{A n B) + Pr{B n A} + Pr{A n B} + [Pr{A n B} - Pr{A n 5}] 
Rearrange this equation as follows: 

Pr{^ U B} = [Pr{A n B) + Pr{A n B}} + [Pr{B n A} + Pr{A n 5}] - Pr{yl n B} 
Pr{A UB} = Pr{A} + Pr{B} - Pr{A n 5} 
(b) In Fig. 6-13, event A is composed of regions 1,2, 3, and 6, event 5 is composed of 1, 3, 4, and 7, and 



Fig. 6-13 The union of three nonmutually exclus 
it C is composed of regions 1, 2, 4, and 5. 



i, AUBUC. 



The sample space in Fig. 6-13 is made up of 8 mutually exclusive regions. The eight regions are described 
as follows: region 1 is A n B n C , region 2 is A n C n 5, region 3 is 4 n 5 n C, region 4 is A n C n 5, region 
5 is i n C n 5, region 6 is A n C n -8, region 7 is ^ n C n 5, and region 8 is i n C n 5. 

The probability Pr{^4 UBUC} may be expressed as the probability of the 7 mutually exclusive 
regions that make up iUfiUCas follows: 

Pr{A n B n C} + Pr{yi ncnl} + Pr{^ n 5 n C} + Pr{i nCnB} 
+ Pr{A ncnl} + Pr{^ n C n 5} + Pr{i n C n B} 
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Each part of this equation may be re-expressed and the whole simplified to the result: 
Pr{A UBUC} = Pr{A} + Pr{B} + Pr{C} - Pr{^ n B} - Pr{B n C} - Pr{.4 n C} + Pr{A nBnC} 
For example, Pr{A n C n -S} is expressible as: 

Pr{C} - Pr{^ n C} - Pr{B n C} + ?r{A n £ H C} 



6.39 In a survey of 500 adults were asked the three-part question (1) Do you own a cell phone, (2) Do 
you own an ipod, and (3) Do you have an internet connection? The results of the survey were as 
follows (no one answered no to all three parts): 

cell phone 329 cell phone and ipod 

ipod 186 cell phone and internet connection 

internet connection 295 ipod and internet connection 

Give the probabilities of the following events: 

(a) answered yes to all three parts, (b) had a cell phone but not an internet connection, (c) had an 
ipod but not a cell phone, (d) had an internet connection but not an ipod, (<?) had a cell phone or 
an internet connection but not an ipod and, (J) had a cell phone but not an ipod or an internet 
connection. 

SOLUTION 

Event A is that the respondent had a cell phone, event B is that the respondent had an ipod, and event C 
is that the respondent had an internet connection. 



7 

ipod 

4 



internet connection 
Fig. 6-14 Venn diagram for Problem 6.39. 

(a) The probability that everyone is in the union is 1 since no one answered no to all three parts. 
Pr{AVJBjC} is given by the following expression: 

Pr{A} + Pr{B} + Pr{C} - Pr{.4 n B} - Pr{B nC}- Pr{^ n C} + Pr{^ n B n C} 

1 = 329/500 + 186/500 + 295/500 - 83/500 - 63/500 - 217/500 + Pr{^ nfiflC} 

Solving for Pr{^ DBnC}, we obtain 1 - 447/500 or 53/500 = 0.106. 

Before answering the other parts, it is convenient to fill in the regions in Fig. 6-14 as shown in Fig. 6-15. 
The number in region 2 is the number in region AdC minus the number in region 1 or 217 — 53 = 164. 
The number in region 3 is the number in region A n B minus the number in region 1 or 83 — 53 = 30. The 
number in region 4 is the number in region BnC minus the number in region 1 or 63 — 53 = 10. 
The number in region 5 is the number in region C minus the number in regions 1, 2, and 4 or 
295 - 53 - 164 - 10 = 68. The number in region 6 is the number in region A minus the number in 
regions 1, 2, and 3 or 329 - 53 - 164 - 30 = 82. The number in region 7 is 186 - 53 - 30 - 10 = 93. 
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internet connection 



Fig. 6-15 A, B, and C broken it 



Region 


number 


1 


53 


2 


164 


3 


30 


4 


10 


5 


68 


6 


82 


7 


93 


Total 


500 



mtually exclusive regions. 



(b) regions 3 and 6 or 30 + 82 = 1 12 and the probability is 1 12/500 = 0.224. 

(c) regions 4 and 7 or 10 + 93 = 103 and the probability is 103/500 = 0.206 

(d) regions 2 and 5 or 1 64 + 68 = 232 and the probability is 232/500 = 0.464. 

(e) regions 2, 5, or 6 or 164+68 + 82 = 314 and the probability is 314/500 = 0.628. 
(/) region 6 or 82 and the probability is 82/500 = 0. 1 64. 



Supplementary Problems 

FUNDAMENTAL RULES OF PROBABILITY 

6.40 Determine the probability p, or an estimate of it, for each of the following events: 

(a) A king, ace, jack of clubs, or queen of diamonds appears in drawing a single card from a well-shuffled 
ordinary deck of cards. 

(b) The sum 8 appears in a single toss of a pair of fair dice. 

(c) A nondefective bolt will be found if out of 600 bolts already examined, 12 were defective. 

(d) A 7 or 11 comes up in a single toss of a pair of fair dice. 

(e) At least one head appears in three tosses of a fair coin. 

6.41 An experiment consists of drawing three cards in succession from a well-shuffled ordinary deck of cards. 
Let E\ be the event "king" on the first draw, E 2 the event "king" on the second draw, and £ 3 the event 
"king" on the third draw. State in words the meaning of each of the following: 

(a) Pr{E\E 2 } (c) £, + E 2 (e) E^E 2 E 3 

(b) Pr{£-,+£ 2 } (d) PrLE^^} {f) Vl{E l E 2 + E 2 E i } 

6.42 A ball is drawn at random from a box containing 10 red, 30 white, 20 blue, and 15 orange marbles. Find the 
probability that the ball drawn is (a) orange or red, (b) not red or blue, (c ) not blue, (d ) white, and (e) red, 
white, or blue. 



6.43 Two marbles are drawn in succession from the box of Problem 6.42, replacement being made after each 
drawing. Find the probability that (a) both are white, (b) the first is red and the second is white, (c) neither is 
orange, (d ) they are either red or white or both (red and white), (e) the second is not blue, (/ ) the first is orange, 
(g) at least one is blue, (h) at most one is red, (;') the first is white but the second is not, and ( /') only one is red. 



CHAP. 6] 



ELEMENTARY PROBABILITY THEORY 



167 



6.44 Work Problem 6.43 if there is no replacement after each drawing. 

6.45 Find the probability of scoring a total of 7 points (a) once. (/;) at least once, and (c) twice in two tosses of a 
pair of fair dice. 

6.46 Two cards are drawn successively from an ordinary deck of 52 well-shuffled cards. Find the probability that 

(a) the first card is not a 10 of clubs or an ace, (b) the first card is an ace but the second is not, (c) at least one 
card is a diamond, (d) (he cards are not of the same suit, (e) not more than one card is a picture card (jack, 
queen, king), (J) the second card is not a picture card, (g) the second card is not a picture card given that the 
first was a picture card, and (h) the cards are picture cards or spades or both. 

6.47 A box contains 9 tickets numbered from 1 to 9 inclusive. If 3 tickets are drawn from the box one at a time, 
find the probability that they are alternately either (1) odd, even, odd or (2) even, odd, even. 

6.48 The odds in favor of A winning a game of chess against B are 3 : 2. If three games are to be played, what are 
the odds (a) in favor of A's winning at least two games out of the three and (b) against A losing the first 
two games to Bl 

6.49 A purse contains 2 silver coins and 4 copper coins, and a second purse contains 4 silver coins and 3 copper 
coins. If a coin is selected at random from one of the two purses, what is the probability that it is a silver 
coin? 

6.50 The probability that a man will be alive in 25 years is I, and the probability that his wife will be alive in 
25 years is \. Find the probability that (a) both will be alive, (b) only the man will be alive, (c) only the wife 
will be alive, and (d) at least one will be alive. 

6.51 Out of 800 families with 4 children each, what percentage would be expected to have (a) 2 boys and 2 girls, 

(b) at least 1 boy, (c) no girls, and (d) at most 2 girls? Assume equal probabilities for boys and girls. 



PROBABILITY DISTRIBUTIONS 



6.52 If X is the random variable showing the number of boys in families with 4 children (see Problem 6.51), 
(a) construct a table showing the probability distribution of X and (b) represent the distribution in part 
(a) graphically. 

6.53 A continuous random variable X that can assume values only between X = 2 and 8 inclusive has a density 
function given by a(X + 3), where a is a constant, (a) Calculate a. Find (b) Pr{3 < X < 5}, (c) Pr{X > 4}, 
and (d) Pr{|Y-5| < 0.5}. 

6.54 Three marbles are drawn without replacement from an urn containing 4 red and 6 white marbles. If X is a 
random variable that denotes the total number of red marbles drawn, (a) construct a table showing the 
probability distribution of X and (b) graph the distribution. 

6.55 (a) Three dice are rolled and Y=the sum on the three upturned faces. Give the probability distribution 
forZ. (b) Find Pr{7<Z<ll). 



MATHEMATICAL EXPECTATION 



6.56 What is a fair price to pay to enter a game in which one can win $25 with probability 0.2 and $10 with 
probability 0.4? 



168 



ELEMENTARY PROBABILITY THEORY 



[CHAP. 6 



6.57 If it rains, an umbrella salesman can earn $30 per day. If it is fair, he can lose $6 per day. What is his 
expectation if the probability of rain is 0.3? 

6.58 A and B play a game in which they toss a fair coin 3 times. The one obtaining heads first wins the game. If A 
tosses the coin first and if the total value of the stakes is $20, how much should be contributed by each in 
order that the game be considered fair? 

6.59 Find (a) E(X), (b) E(X 2 ), (c) E[(X - X)\ and (d) £(X 3 ) for the probability distribution of Table 6.4. 



Table 6.4 



X 


-10 -20 30 


P(X) 


1/5 3/10 1/2 



6.60 Referring to Problem 6.54, find the (a) mean, (b) variance, and (c) standard deviation of the distribution of 
X, and interpret your results. 

6.61 A random variable assumes the value 1 with probability p, and with probability q = 1 — p. Prove that 
(a) E(X) = p and (b) E[(X - X) 1 } = pq. 

6.62 Prove that (a) E(2X + 3) = 2E(X) + 3 and (b) E[(X - X) 2 } = E(X 2 ) - [E(X)f . 

6.63 In Problem 6.55, find the expected value of X. 



PERMUTATIONS 

6.64 Evaluate (a) 4 P 2 , (b) 7 P 5 , and (c) ]0 P 3 . Give the EXCEL function for parts (a), (b), and (c). 

6.65 For what value of n is n+l P 3 = „P 4 1 

6.66 In how many ways can 5 people be seated on a sofa if there are only 3 seats available? 

6.67 In how many ways can 7 books be arranged on a shelf if (a) any arrangement is possible, (b) 3 particular 
books must always stand together, and (c) 2 particular books must occupy the ends? 

6.68 How many numbers consisting of five different digits each can be made from the digits 1, 2, 3, . . . , 9 if («) the 
numbers must be odd and (b) the first two digits of each number are even? 

6.69 Solve Problem 6.68 if repetitions of the digits are allowed. 

6.70 How many different three-digit numbers can be made with three 4's, four 2's, and two 3's? 

6.71 In how many ways can 3 men and 3 women be seated at a round table if (a) no restriction is imposed, 
(b) 2 particular women must not sit together, and (c) each woman is to be between 2 men? 
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COMBINATIONS 



6.72 Evaluate (a) ^ J , (b) , and (c) ^ g J . Give the EXCEL function for parts (a), (b), and (c). 

6.73 For what value of n does 3 ^ = 7 G) ? 

6.74 In how many ways can 6 questions be selected out of 10? 

6.75 How many different committees of 3 men and 4 women can be formed from 8 men and 6 women? 

6.76 In how many ways can 2 men, 4 women, 3 boys, and 3 girls be selected from 6 men, 8 women, 4 boys, and 
5 girls if (a) no restrictions are imposed and (b) a particular man and woman must be selected? 

6.77 In how many ways can a group of 10 people be divided into (a) two groups consisting of 7 and 3 people and 
(b) three groups consisting of 4, 3, and 2 people? 

6.78 From 5 statisticians and 6 economists a committee consisting of 3 statisticians and 2 economists is to be 
formed. How many different committees can be formed if (a) no restrictions are imposed, (b) 2 particular 
statisticians must be on the committee, and (c) 1 particular economist cannot be on the committee? 

6.79 Find the number of (a) combinations and (b) permutations of four letters each that can be made from the 
letters of the word Tennessee! 

6.80 Prove that 1 - Q + Q - Q + • • • + (-1)" Q = 0. 



STIRLING'S APPROXIMATION TO n! 



6.81 In how many ways can 30 individuals be selected out of 100? 

6.82 Show that = 2 2 "/ v / ™ approximately, for large values of n. 



MISCELLANEOUS PROBLEMS 



6.83 Three cards are drawn from a deck of 52 cards. Find the probability that (a) two are jacks and one is a king, 
(b) all cards are of one suit, (c) all cards are of different suits, and (d) at least two aces are drawn. 

6.84 Find the probability of at least two 7's in four tosses of a pair of dice. 

6.85 If 10% of the rivets produced by a machine are defective, what is the probability that out of 5 rivets chosen 
at random (a) none will be defective, (b) 1 will be defective, and (c) at least 2 will be defective? 

6.86 (a) Set up a sample space for the outcomes of 2 tosses of a fair coin, using 1 to represent "heads" and to 

represent "tails." 

(b) From this sample space determine the probability of at least one head. 

(c) Can you set up a sample space for the outcomes of 3 tosses of a coin? If so, determine with the aid of it 
the probability of at most two heads? 



ELEMENTARY PROBABILITY THEORY 



6.87 A sample poll of 200 voters revealed the following information concerning three candidates (A, B, and C) of 
a certain party who were running for three different offices: 

28 in favor of both A and B 122 in favor of B or C but not A 

98 in favor of A or B but not C 64 in favor of C but not A or B 
42 in favor of B but not A or C 14 in favor of A and C but not B 

How many of the voters were in favor of (a) all three candidates, (b) A irrespective of B or C, (c) B 
irrespective of A or C, (d) C irrespective of A or B, (<?) A and B but not C, and (/) only one of the 
candidates? 

6.88 (a) Prove that for any events £, and E 2 , Pr{£, + £ 2 } < Pr{£, } + Pr{£ 2 }. 
(Z>) Generalize the result in part (a). 

6.89 Let E x , E 2 , and E 3 be three different events, at least one of which is known to have occurred. Suppose that 
any of these events can result in another event A, which is also known to have occurred. If all the prob- 
abilities Pr{£ t }, Pr{£ 2 }, Pr{£ 3 } and Pr{^|£,}, Pr{A\E 2 }, Pr{^|£ 3 } are assumed known, prove that 

Pr / £l \ A} = PrfMPr^lM 

1 11 ' Pr{£,} Pr{A\E{} + Pr{E 2 } Pr{^|£ 2 } + Pr{£ 3 } Pr{A\E 3 } 

with similar results for Pr{£ 2 |^} and Pr{£ 3 |^4}. This is known as Bayes' rule or theorem. It is useful in 
computing probabilities of various hypotheses E u E 2 , or E 3 that have resulted in the event A. The result can 
be generalized. 

6.90 Each of three identical jewelry boxes has two drawers. In each drawer of the first box there is a gold watch. 
In each drawer of the second box there is a silver watch. In one drawer of the third box there is a gold watch, 
while in the other drawer there is a silver watch. If we select a box at random, open one of the drawers, 
and find it to contain a silver watch, what is the probability that the other drawer has the gold watch? 
[Hint: Apply Problem 6.89.] 

6.91 Find the probability of winning a state lottery in which one is required to choose six of the numbers 1, 2, 
3, . . . , 40 in any order. 

6.92 Work Problem 6.91 if one is required to choose (a) five, (h) four, and (c) three of the numbers. 

6.93 In the game of poker, five cards from a standard deck of 52 cards are dealt to each player. Determine the 
odds against the player receiving: 

(a) A royal flush (the ace, king, queen, jack, and 10 of the same suit) 

(b) A straight flush (any five cards in sequence and of the same suit, such as the 3, 4, 5, 6, and 7 of spades) 

(c) Four of a kind (such as four 7's) 

(d) A full house (three of one kind and two of another, such as three kings and two 10's) 

6.94 A and B decide to meet between 3 and 4 p.m. but agree that each should wait no longer than 10 minutes for 
the other. Determine the probability that they meet. 

6.95 Two points are chosen at random on a line segment whose length is a > 0. Find the probability that the three 
line segments thus formed can be the sides of a triangle. 

6.96 A regular tetrahedron consists of four sides. Each is equally likely to be the one facing down when the 
tetrahedron is tossed and comes to rest. The numbers 1, 2, 3, and 4 are on the four faces. Three regular 
tetrahedrons are tossed upon a table. Let X= the sum of the three faces that are facing down. Give the 
probability distribution for X. 
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6.97 In Problem 6.96, find the expected value of X. 

6.98 In a survey of a group of people it was found that 25% were smokers and drinkers, 10% were smokers but 
not drinkers, and 35% were drinkers but not smokers. What percent in the survey were either smokers or 
drinkers or both? 

6.99 Acme electronics manufactures MP3 players at three locations. The plant at Omaha manufactures 50% 
of the MP3 players and 1% are defective. The plant at Memphis manufactures 30% and 2% at that plant 
are defective. The plant at Fort Meyers manufactures 20% and 3% at that plant are defective. If an MP3 
player is selected at random, what is the probability that it is defective? 



6.100 



Refer to Problem 6.99. An MP3 player is found to be defective. What is the probability that it was 
manufactured in Fort Meyers? 



The Binomial, 
Normal, and Poisson 
Distributions 



THE BINOMIAL DISTRIBUTION 

If p is the probability that an event will happen in any single trial (called the probability of a success) 
and q = 1 — p is the probability that it will fail to happen in any single trial (called the probability of a 
failure), then the probability that the event will happen exactly X times in N trials (i.e., X successes and 
N - X failures will occur) is given by 



where X = 0, 1, 2, . . . , N; N\ = N(N - l)(N - 2) ■ ■ ■ 1; and 0! = 1 by definition (see Problem 6.34). 
EXAMPLE 1. The probability of getting exactly 2 heads in 6 tosses of a fair coin is 



using formula (7) with N = 6, X = 2, and p = q = \. 

Using EXCEL, the evaluation of the probability of 2 heads in 6 tosses is given by the following: 
=BINOMDIST(2,6,0.5,0), where the function BINOMD1ST has 4 parameters. 

The first parameter is the number of successes, the second is the number of trials, the third is the probability of 
success, and the fourth is a or 1. A zero gives the probability of the number of successes and a 1 gives the 
cumulative probability. The function =BINOMDIST(2,6,0.5,0) gives 0.234375 which is the same as 15/64. 

EXAMPLE 2. The probability of getting at least 4 heads in 6 tosses of a fair coin is 





}2 
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The discrete probability distribution (7) is often called the binomial distribution since for X = 0, 1, 
2, . . . , N it corresponds to successive terms of the binomial formula, or binomial expansion, 

( q +pf = q N + [ly-'p + ( N 2 y-y +•••+/ 

where 1 , (^) , (^) , ■ ■ • are called the binomial coefficients. 

Using EXCEL, the solution is = 1-BINOMDIST(3,6,0.5,1) or 0.34375 which is the same as 11/32. 
Since Pr{X>4}= 1 - Pr{X<3}, and BINOMDIST(3,6,0.5,l) = Pr{X< 3}, this computation will give 
the probability of at least 4 heads. 

EXAMPLE 3. (q + p) 4 = q 4 + Q q 3 p + Q q 2 p 2 + Q qp 3 + p 4 

= q 4 + 4q 3 p + 6q 2 p 2 + Aqp 3 + p 4 
Some properties of the binomial distribution are listed in Table 7.1. 



Table 7.1 Binomial Distribution 



Mean 


fi = Np 


Variance 




Standard deviation 


a= \J~Npq 


Moment coefficient of skewness 


&i = <i-p 

s/Npq 


Moment coefficient of kurtosis 


1 - 6pq 

a 4 = 3 H 

Npq 



EXAMPLE 4. In 100 tosses of a fair coin the mean number of heads is = Np = (100)0 = 50; this is the expected 
number of heads in 100 tosses of the coin. The standard deviation is a = \/Npq = ^/(100)(j @ = 5. 



THE NORMAL DISTRIBUTION 

One of the most important examples of a continuous probability distribution is the normal distribu- 
tion, normal curve, or gaussian distribution. It is defined by the equation 

where \x = mean, a = standard deviation, tt = 3.14159 • • •, and e = 2.71828 • • •. The total area bounded 
by curve (3) and the X axis is 1; hence the area under the curve between two ordinates X = a and X = b, 
where a < b, represents the probability that X lies between a and b. This probability is denoted by 
Pr{,3 < X < b}. 

When the variable X is expressed in terms of standard units [z = (X — ^i)/cr], equation (J) is replaced 
by the so-called standard form: 
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In such cases we say that z is normally distributed with mean and variance 1. Figure 7-1 is a graph of this 
standardized normal curve. It shows that the areas included between z = — 1 and +1, z = —2 and +2, 
and z — -3 and +3 are equal, respectively, to 68.27%, 95.45%, and 99.73% of the total area, which is 1. 
The table in Appendix II shows the areas under this curve bounded by the ordinates at z = and any 
positive value of z. From this table the area between any two ordinates can be found by using the 
symmetry of the curve about z = 0. 

Some properties of the normal distribution given by equation (3) are listed in Table 7.2. 




-3-2-10123 
Fig. 7-1 Standard normal curve: 68.27% of the area is between z=— 1 and z= 1, 95.45% is between z = — 2 and 
z = 2, and 99.73% is between z = -3 and z = 3. 



Table 7.2 Normal Distribution 



Mean 




Variance 




Standard deviation 




Moment coefficient of skewness 


Q 3 =0 


Moment coefficient of kurtosis 




Mean deviation 


GsJlfK = 0.7979a 



RELATION BETWEEN THE BINOMIAL AND NORMAL DISTRIBUTIONS 

If N is large and if neither p nor q is too close to zero, the binomial distribution can be closely 
approximated by a normal distribution with standardized variable given by 
X -Np 

The approximation becomes better with increasing N, and in the limiting case it is exact; this is shown in 
Tables 7.1 and 7.2, where it is clear that as N increases, the skewness and kurtosis for the binomial 
distribution approach that of the normal distribution. In practice the approximation is very good if both 
Np and Nq are greater than 5. 
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EXAMPLE 5. Figure 7-2 shows the binomial distribution with N= 16 and p = 0.5, reflecting the probabilities of 
getting X heads in 16 flips of a coin and the normal distribution with mean equal to 8 and standard deviation 2. 
Notice how similar and close the two are. X is binomial with mean = JV/> = 16(0.5) = 8 and standard deviation 
yHVpq = ^16(0.5X0.5) = 2. Y is a normal curve with mean = 8 and standard deviation 2. 



Fig. 7-2 Plot of binomial for N= 16 and p = 0.5 and normal curve with mean = 8 and standard deviation = 2. 



THE POISSON DISTRIBUTION 

The discrete probability distribution 

p(X)=^- X = 0,1,2,... (J) 

where e = 2.71828 • • • and A is a given constant, is called the Poisson distribution after Simeon-Denis 
Poisson, who discovered it in the early part of the nineteenth century. The values of p(X) can be 
computed by using the table in Appendix VIII (which gives values of e~ A for various values of A) or 
by using logarithms. 

EXAMPLE 6. The number of admissions per day at an emergency room has a Poisson distribution and the mean 
is 5. Find the probability of at most 3 admissions per day and the probability of at least 8 admissions per day. 
The probability of at most 3 is Pr{Z< 3} = ^ 5 {5°/0! + 5'/l! + 5 2 /2! + 5 3 /3!}. From Appendix VIII, e~ 5 = 0.006738, 
and Pr{X< 3} = 0.006738(1 + 5+ 12.5 + 20.8333} = 0.265. Using MINITAB, the pull-down "Calc => Probability 
distribution => Poisson" gives the Poisson distribution dialog box which is filled out as shown in Fig. 7-3. 
The following output is given: 

Cumulative Distribution Function 

Poisson with mean = 5 
x P(X< = x) 
3 0.265026 

The same answer as found using Appendix VIII is given. 
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The probability of at least 8 admissions is Pr{X> 8} = 1 - Pr{X< 7}. Using MINITAB, we find: 




Fig. 7-3 MINITAB 



distribution dialog box. 



Cumulative Distribution Function 

Poisson with mean = 5 
x P(X< = x) 
7 0.866628 



Pr{X> 8} = 1 - 0.867 = 0.133. 
Some properties of the Poisson distribution 



listed in Table 7.3. 
Table 7.3 Poisson's Distribution 



Mean 




Variance 




Standard deviation 


a=V\ 


Moment coefficient of skewness 




Moment coefficient of kurtosis 


c* 4 = 3+ 1/A 



RELATION BETWEEN THE BINOMIAL AND POISSON DISTRIBUTIONS 

In the binomial distribution (7), if TV is large while the probability p of the occurrence of an event is 
close to 0, so that q = 1 — p is close to 1 , the event is called a rare event. In practice we shall consider an 
event to be rare if the number of trials is at least 50 {N > 50) while Np is less than 5. In such case 
the binomial distribution (7) is very closely approximated by the Poisson distribution (5) with A = Np. 
This is indicated by comparing Tables 7.1 and 7.3 — for by placing A = Np, q ss 1, and p « in Table 7.1, 
we get the results in Table 7.3. 

Since there is a relation between the binomial and normal distributions, it follows that there also is 
a relation between the Poisson and normal distributions. It can in fact be shown that the Poisson distribu- 
tion approaches a normal distribution with standardized variable (X - as A increases indefinitely. 
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THE MULTINOMIAL DISTRIBUTION 

If events E\, E 2 , . . . , E K can occur with probabilities p l ,p 2 ,... ,Pk, respectively, then the probability 
that Ei, E 2 , ■ ■ ■ ,E K will occur X\, X 2 , . . . ,X K times, respectively, is 



where X { + X 2 + ■ ■ ■ + X K — N. This distribution, which is a generalization of the binomial distribution, 
is called the nuiliinoniial distribution since equation (6) is the general term in the multinomial expansion 
(Pi+P2 + ---+Pk) N - 

EXAMPLE 7. If a fair die is tossed 12 times, the probability of getting 1, 2, 3, 4, 5, and 6 points exactly twice each is 

1925 



559,872 " 00344 



The expected numbers of times that E x , E 2 ,...,E K will occur in N trials are Np x , Np 2 , Np K , respectively. 
EXCEL can be used to find the answer as follows. =MULTINOMlAL(2,2.2.2,2,2) is used to evaluate 2!2 ,jg, 2!2 , 
which gives 7,484,400. This is then divided by 6 12 which is 2,176,782,336. The quotient is 0.00344. 



FITTING THEORETICAL DISTRIBUTIONS TO SAMPLE FREQUENCY DISTRIBUTIONS 

When one has some indication of the distribution of a population by probabilistic reasoning or 
otherwise, it is often possible to fit such theoretical distributions (also called model or expected distribu- 
tions) to frequency distributions obtained from a sample of the population. The method used consists in 
general of employing the mean and standard deviation of the sample to estimate the mean and sample of 
the population (see Problems 7.31, 7.33, and 7.34). 

In order to test the goodness of fit of the theoretical distributions, we use the chi-square test (which 
is given in Chapter 12). In attempting to determine whether a normal distribution represents a good fit 
for given data, it is convenient to use normal-curve graph paper, or probability <;rapli paper as it is 
sometimes called (see Problem 7.32). 



Solved Problems 



THE BINOMIAL DISTRIBUTION 

7.1 Evaluate the following: 

(a) 51 (c) Q (e) Q 

<*> 2T4! ^ Q if) Q 

SOLUTION 

(a) 5! = 5 - 4 - 3 • 2 - 1 = 120 

6! = 6-5-4-3-2-1 = 6-5 
V) 2!4! (2 - 1)(4 -3-2-1) 2-1 



8! 



3! (8 -3)! 3!5! (3 - 2 - 1)(5 -4-3-2-1) 3-2- 
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7! _ 7-6-5-4-3-2-1 _7-6_ 
= 5!2!~ (5 -4- 3 -2 - 1)(2- 1) ~2 _ T~ 



:e 0! = 1 by definition 



7.2 Suppose that 15% of the population is left-handed. Find the probability that in group 
of 50 individuals there will be (a) at most 10 left-handers, (b) at least 5 left-handers, 
(c) between 3 and 6 left-handers inclusive, and (d) exactly 5 left-handers. Use EXCEL to find 
the solutions. 

SOLUTION 

(a) The EXCEL expression =BINOMDIST(10,50,0.15,1) gives Pr{X< 10} or 0.8801. 

(b) We are asked to find Pr{X>5} which equals l-Pr{X<4} since Z>5 and X<4 are comple- 
mentary events. The EXCEL expression that gives the desired result is =1-BINOMDIST(4,50,0.15,1) 
or 0.8879. 

(c) We are asked to find Pr{3 < X< 6} which equals Pr{X< 6} - Pr{Z< 2}. The EXCEL expression to give 
the result is =BINOMDIST(6,50,0.15,1)-BINOMDIST(2,50,0.15,1) or 0.3471. 

(d) The EXCEL expression =BINOMDIST(5,50,0.15,0) gives Pr{Z= 5} or 0.1072. 



7.3 Find the probability that in five tosses of a fair die a 3 appears (a) at no time, (b) once, (c) twice, 
(d) three times, (e) four times, (f) five times, and (g) give the MIN1TAB solution. 
SOLUTION 

The probability of 3 in a single toss = p = g, and the probability of no 3 in a single toss = q= l— p = ^ 

thus: 

« — >-(9W-<'X'>(3'-£i 

<» Pl{ 3oeeu«„«, im ^Q(i)'(|) 4 H 5 )(i)(|) < ^ 

W Pr(3 occurs three ,,„«,= (j) Q' (.0) (^) (g) 
(«, Pr,3 occurs four ,,„«; = Q Q) 4 (5) '= (5)^) (|) = JL 
(/) P,{3 occurs five times)- Q (i) S (J)°. (l)(^j)(D - ^ 
Note that these probabilities represent the terms in the binomial expansion 

(H - !+ (0(D*G) + 00'0 i+ ©¥ + G) 4+ 1 

(g) enter the integers through 5 in column CI and then fill in the dialog box for the binomial distribution 
as in Fig. 7-4. 
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P Probability 

<~ Cumulative probability 

r Inverse cumulative probability 

Number of trials: fs 
Probability of success: jo 166 



<* Input column: 
Optional storage: 

C Input constant: 
Optional storage: 



[cT- 



Help 



Fig. 7-4 MINITAB dialog box for Problem 7.3 (g). 



The following output is produced in the worksheet: 

Cl C2 

0.401894 

1 0.401874 

2 0.160742 

3 0.032147 

4 0.003215 

5 0.000129 

Show that the fractions given in parts (a) through (/) convert to the decimals that 
MINITAB gives. 

7.4 Write the binomial expansion for (a)(q + p) 4 and (b)(q + pf . 
SOLUTION 

(a) (q + p) 4 = q 4 + ( <7 3 ;> + Q <7V + (0 + p 4 

= q 4 + 4q 3 p + 6q 2 p 2 + Aqp 1 + p 4 

(b) (q+p) 6 = q 6 + Q q s p + Q q 4 p 2 + Q <?V 3 + Q <?V 4 + Q + p 6 

= q 6 + 6q 5 p + I5q 4 p 2 + 20q 3 p 3 + I5q 2 p 4 + 6qp 5 + p 6 

The coefficients 1, 4, 6, 4, 1 and 1,6, 15, 20, 1 5, 6. 1 are called the binomial coefficients corresponding to 
N = 4 and N = 6, respectively. By writing these coefficients for N = 0, 1, 2, 3, . . . , as shown in the following 
array, we obtain an arrangement called Pascal's triangle. Note that the first and last numbers in each row are 
1 and that any other number can be obtained by adding the two numbers to the right and left of it in the 
preceding row. 
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1 1 

1 2 1 
13 3 1 
1 4 6 4 1 
1 5 10 10 5 1 
1 6 15 20 15 6 1 

7.5 Find the probability that in a family of 4 children there will be (a) at least 1 boy and (b) at least 
1 boy and 1 girl. Assume that the probability of a male birth is ±. 

SOLUTION 

Thus Pr{at least 1 boy} = Pr{l boy} + Pr{2 boys} + Pr{3 boys} + Pr{4 boys} 

_1 3 1 J__15 
~4 + 8 + 4 + 16~ 16 

Another method 



1 boy} = 



- Prlno boy} = 



16 16 

(b) Pr{at least 1 boy and 1 girl} = 1 - Pr{no boy} - Pr{no girl} = 1 - - 



7.6 Out of 2000 families with 4 children each, how many would you expect to have (a) at least 1 boy, 
(b) 2 boys, (c) 1 or 2 girls, and (d) no girls? Refer to Problem 7.5(a). 



SOLUTION 

(a) Expected number of families with at least 1 boy = 2000({f) = 1875 

(*) Expected number of families with 2 boys = 2000 • Pr{2 boys} = 2000(f) = 750 

(c) Pr{ 1 or 2 girls} = Pr{ 1 girl} + Pr{2 girls} = Pr{ 1 boy} + Pr{2 boys} = \ + | = §. Expected number of 
families with 1 or 2 girls = 2000(f) = 1250 

(d) Expected number of families with no girls = 2000(^) = 125 



7.7 If 20% of the bolts produced by a machine are defective, determine the probability that, out of 
4 bolts chosen at random, (a) 1, (b) 0, and (c) at most 2 bolts will be defective. 

SOLUTION 

The probability of a defective bolt is p = 0.2, and the probability of a nondefective bolt is 
q = 1 -p = 0.8. 

(a) Pr{l defective bolt out of 4} = H j (0.2)'(0.8) 3 = 0.4096 
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(b) 
to 



Pr{0 defective bolts} = Q (0.2)°(0.8) 4 = 0.4096 
Pr{2 defective bolts} = Q (0.2) 2 (0.8) 2 = 0.1536 



Thus 

Pr{at most 2 defective bolts} = Pr{0 defective bolts} + Pr{l defective bolt} + Pr{2 defective bolts} 
= 0.4096 + 0.4096 + 0.1536 = 0.9728 

7.8 The probability that an entering student will graduate is 0.4. Determine the probability that out of 
5 students, (a) none will graduate, (b) 1 will graduate, (c) at least 1 will graduate, (d) all will 
graduate, and (e) use STATISTIX to answer parts (a) through (d). 



{a) Pr{none will graduate} = Q (0.4)° (0.6) 5 = 0.07776 or about 0.08 

{b) Pr{l will graduate} = Q (0.4) 1 (0.6) 4 = 0.2592 or about 0.26 

(c) Pr{at least 1 will graduate} = 1 - Pr{none will graduate} = 0.92224 or about 0.92 

(d) Pr{all will graduate} = ^j(0.4) 5 (0.6)° = 0.01024 or about 0.01 

(e) STATISTIX evaluates only the cumulative binomial. The dialog box in Fig. 7-5 gives the cumulative 
binomial distribution for N=5, p = 0.4, q = 0.6, and x = 0, 1, 4, and 5. 



Probabiliy Functic 



Function— 



T Beta [it, a, b] 

t? Binomial (x, n, p) 

C Chi-square (x, df) 

£* Correlation (m, n) 

r F (x, dfnum, dlden] 

C F Inverse (p. dlnum. dfden] 

T Hj>pergeo(x1,x2,n1.n2] 

T Neg-Binomial [n+x, n, p] 

C Poisson (xjarnbda) 

r T1-Tai(«,df] 

r T2-Tai(«,df) 

C T Inverse (p, df) 

C Z1-TailM 
-Ta,l 



Help 



[oT" 



Results 



Binomial [0, 5, 0.4) = 0.07776 
Binomial [1,5, 0.4) = 0.33696 
RnnmN [i fj. fl 4] = I I WF. 



Fig. 7-5 STATISTIX dialog box for Problem 7.8 (e). 
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Referring to the dialog box, the probability none will graduate is Pr{X=0} = Binomial(0,5,0.4) = 0.07776. 
The probability 1 will graduate is Pr{X= 1} = Pr{X< 1} - Pr{JT< 0} = Binomial(l, 5,0.4) - Binomial 
(0,5,0.4) = 0.33696 - 0.07776 = 0.2592. The probability at least 1 will graduate is Pr{Z> 1} = 1 - Pr{Z= 0} = 
1-Binomial(0,5,0.4) = 0.92224. The probability all will graduate is Pr{X= 5} = Pr{Z< 5} - Pr{JT< 4} = 
Binomial(5,5,0.4)-Binomial(4,5,0.4)= 1.00000 -0.98976 = 0.01024. Note that STATISTIX gives only 
the cumulative binomial and some textbook tables also give only cumulative binomial probabilities. 

7.9 What is the probability of getting a total of 9 (a) twice and (b) at least twice in 6 tosses of a pair 
of dice? 
SOLUTION 

Each of the 6 ways in which the first die can fall can be associated with each of the 6 ways in which the 
second die can fall; thus there are 6 • 6 = 36 ways in which both dice can fall. These are: 1 on the first die and 
1 on the second die, 1 on the first die and 2 on the second die, etc., denoted by (1, 1), (1, 2), etc. 

Of these 36 ways (all equally likely if the dice are fair), a total of 9 occurs in 4 cases: (3, 6), (4, 5), (5, 4), 
and (6, 3). Thus the probability of a total of 9 in a single toss of a pair of dice is p = ^ = ^, and the 
probability of not getting a total of 9 in a single toss of a pair of dice is q = 1 - p = |. 

(a) Pr{2 mnes m 6 tosses} = ^ ^ ^ = 

(b) Pr{at least 2 nines} = Pr{2 nines} + Pr{3 nines} + Pr{4 nines} + Pr{5 nines} + Pr{6 nines} 

_ 61,440 10,240 960 48 1 _ 72,689 

~ 531,441 + 531,441 + 531,441 + 531,441 + 531,441 ~ 531,441 

Another method 



Pr{at least 2 nines} = 1 - Pr{0 nines} - Pr{l nine} 
= 1 - 



-G)G)'G)-(!)W-£ 



531,441 

7.10 Evaluate (a) Y,x=o X P( X ) and (*) E^=o x2 f( x )> where f( x ) = 
SOLUTION 

(a) Since g+p=h 

= Np(q + pf- 1 = Np 



^(x-iy.iN-xy. 1 

= N{N ~\ )p 2 + Np 
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Note: The results in parts (a) and (b) are the expectations of X and X 2 , denoted by E(X) and E(X 2 ), 
respectively (see Chapter 6). 

7.11 If a variable is binomially distributed, determine its (a) mean y, and (b) variance a 2 . 
SOLUTION 

(a) By Problem 7.10(a), 

/i = expectation of variable = ^^Xp(X) = Np 

(b) Using y = Np and the results of Problem 7.10, 

cr 2 = it(X- ri 2 P (X) = J2(X 2 - 2yX + n 2 )p(X) = X 2 p(X) -2yJ2 W) + S f>W 

= 7V(JV - 1)/ +Np - 2{Np){Np) + {Npf{l) = Np - iVp 2 = Np(l - p) = Npq 
It follows that the standard deviation of a binomially distributed variable is a = \JNpq. 
Another method 

By Problem 6.62(6), 

E[(X - X)] 2 = E(X 2 ) - [E(X)] 2 = N(N - \)p 2 + Np - N 2 p 2 = Np - Np 2 = Npq 

7.12 If the probability of a defective bolt is 0. 1 , find (a) the mean and (b) the standard deviation for the 
distribution of defective bolts in a total of 400. 

SOLUTION 

(a) The mean is Np = 400(0.1) = 40; that is, we can expect 40 bolts to be defective. 

(b) The variance is Npq = 400(0.1)(0.9) = 36. Hence the standard deviation is \/36 = 6. 

7.13 Find the moment coefficients of (a) skewness and (b) kurtosis of the distribution in Problem 7.12. 
SOLUTION 

(a) Moment coefficient of skewness = — = = — =0.133 



Since this is positive, the distribution is skewed to the right. 

(b) Moment coefficient of kurtosis = 3 + i-^ = 3 + 1 = 6( °J )( °- 9) = 3.01 

Npq 36 

The distribution is slightly Icpiokurtic with respect to the normal distribution (i.e., slightly more peaked; 
see Chapter 5). 

THE NORMAL DISTRIBUTION 

7.14 On a final examination in mathematics, the mean was 72 and the standard deviation was 15. 
Determine the standard scores (i.e., grades in standard-deviation units) of students receiving 
the grades (a) 60, (b) 93, and (c) 72. 

SOLUTION 

X-X 60- 72 „ „ . . I-I 72-72 „ 
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7.15 Referring to Problem 7.14, find the grades corresponding to the standard scores (a) -1 and 
(b) 1.6. 

SOLUTION 

(a) X = X + zs = 72 + (-1)(15) = 57 (b) X = X + zs = 72 + (1.6)(15) = 96 



7.16 Suppose the number of games in which major league baseball players play during their careers is 
normally distributed with mean equal to 1500 games and standard deviation equal to 350 games. 
Use EXCEL to solve the following problems, (a) What percentage play in fewer than 750 games? 
(b) What percentage play in more than 2000 games? (c) Find the 90th percentile for the number of 
games played during a career. 

SOLUTION 

(a) The EXCEL statement =NORMDIST(750, 1500,350, 1) requests the area to the left of 750 for a 
normal curve having mean equal to 1500 and standard deviation equal to 350. The answer is 
Pr{X< 1500} = 0.0161 or 1.61% play in less than 750 games. 

(b) The EXCEL statement =1-NORMDIST(2000, 1500,350, 1) requests the area to the right of 2000 for a 
normal curve having mean equal to 1500 and standard deviation equal to 350. The answer is 
Vr{X> 2000} = 0.0766 or 7.66 % play in more than 2000 games. 

(c) The EXCEL statement =NORMINV(0.9, 1500, 350) requests the value on the horizontal axis which is 
such that there is 90% of the area to its left under the normal curve having mean equal to 1500 and 
standard deviation equal to 350. In the notation of Chapter 3, P 90 = 1948.5. 



7.17 Find the area under the standard normal curve in the following case: 

(a) Between z = 0.81 and z = 1.94. 

(b) To the right of z = -1.28. 

(c) To the right of z = 2.05 or to the left of z = - 1 .44. 

Use Appendix II as well as EXCEL to solve (a) through (c) [see Figs. 7-6 (a), (b), and (c)]. 
SOLUTION 

(a) In Appendix II go down the z column until you reach 1.9; then proceed right to the column marked 4. 
The result 0.4738 is Pr{0<z< 1.94}. Next, go down the z column until you reach 0.8; then proceed 
right to the column marked 1. The result 0.2910 is Pr{0 < z < 0.81}. The area Pr{0.81 <z< 1.94} is the 
difference of the two, Pr{0<z< 1.94} - Pr{0<z<0.81} = 0.4738-0.2910 = 0.1828. Using EXCEL, 
the answer is =NORMSDIST(1.94)-NORMSDIST(0.81) = 0.1828. When EXCEL is used, the area 
Pr{0.81 <z< 1.94} is the difference Pr{ -infinity < z < 1.94} - Pr{ -infinity < z < 0.81}. Note that the 
table in Appendix II gives areas from to a positive z-value and EXCEL gives the area from -infinity 
to the same z-value. 

(b) The area to the right of z= -1.28 is the same as the area to the left of z= 1.28. Using Appendix II, the 
area to the left of z= 1.28 is Pr{z<0} + Pr{0<z< 1.28} or 0.5 + 0.3997 = 0.8997. Using EXCEL, 
Pr{z>-1.28} = Pr{z< 1.28} and Pr{z< 1.28} is given by =NORMSDIST(1.28) or 0.8997. 

(c) Using Appendix II, the area to the right of 2.05 is 0.5 - Pr{z< 2.05} or 0.5 - 0.4798 = 0.0202. The area 
to the left of -1.44 is the same as the area to the right of 1.44. The area to the right of 1.44 is 
0.5 - Pr{z< 1.44} = 0.5 - 0.4251 = 0.0749. The two tail areas are equal to 0.0202 + 0.0749 = 0.0951. 
Using EXCEL, the area is given by the following: =NORMSDIST(-1.44) + (1 - NORMSDIST(2.05)) 
or 0.0951. 

Note in EXCEL, =NORMSDIST(z) gives the area under the standard normal curve to the left of z 
whereas =NORMDIST(z, u, a, 1) gives the area under the normal curve having mean u and standard 
deviation a to the left of z. 
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7.18 The time spent watching TV per week by middle-school students has a normal distribution with 
mean 20.5 hours and standard deviation 5.5 hours. Use MINITAB to find the percent who watch 
less than 25 hours per week. Use MINITAB to find the percent who watch over 30 hours 
per week. Sketch a curve representing these two groups. 

SOLUTION 

Figure 7-7 illustrates the solution. 




x=25 x=30 

Fig. 7-7 MINITAB plot of group who watch less than 25 and group who watch more than 
30 hours per week of TV. 

The pull down "Calc => Probability distributions => Normal" gives the normal distribution dialog box as 
in Fig. 7-8. 



Normal Distribution 








C Probability density 






f* Cumulative probability 






r Inverse cumulative probability 






Mean: 1 20 .5 






Standard deviation: |s . s 

f Input column: 

Optional storage: [ 
P Input constant: |2S| 




Optional storage: [ 


Select 1 




Help | 


OK Cancel | 







Fig. 7-8 Normal distribution MINITAB dialog box. 
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When the dialog box is completed as shown in Fig. 7.8 and executed, the following output results: 
Cumulative Distribution Function 

Normal with mean = 20.5 and standard deviation = 5.5 
x P(X< = x) 
25 0.793373 

79.3% of the middle-school students watch 25 hours or less per week. 
When 30 is entered as the input constant, the following output results: 

Cumulative Distribution Function 

Normal with mean = 20.5 and standard deviation = 5.5 
x P(X< = x) 
30 0.957941 

The percent who watch more than 30 hours per week is 1 - 0.958 = 0.042 or 4.2%. 

7.19 Find the ordinates of the normal curve at (a) z = 0.84, (b) z = -1.27, and (c) z = -0.05. 
SOLUTION 

(a) In Appendix I, proceed downward under the column headed z until reaching the entry 0.8; then proceed 
right to the column headed 4. The entry 0.2803 is the required ordinate. 

(b) By symmetry: (ordinate at z = - 1 .27) = (ordinate at z = 1.27)= 0.1781. 

(c) (Ordinate at z = -0.05) = (ordinate at z = 0.05) = 0.3984. 

7.20 Use EXCEL to evaluate some suitably chosen ordinates of the normal curve with mean equal 13.5 
and standard deviation 2.5. Then use the chart wizard to plot the resulting points. This plot 
represents the normal distribution for variable X, where X represents the time spent on the 
internet per week for college students. 

SOLUTION 

Abscissas are chosen from 6 to 21 at 0.5 intervals and entered into the EXCEL worksheet in A1A31. The 
expression =NORMD1ST(A1,13.5,3.5,0) is entered into cell Bl and a click-and-drag is performed. These are 
points on the normal curve with mean equal to 13.5 and standard deviation 3.5: 

6 0.001773 
6.5 0.003166 

7 0.005433 
7.5 0.008958 

8 0.01419 
8.5 0.021596 

9 0.03158 
9.5 0.044368 

10 0.059891 
10.5 0.077674 

11 0.096788 
11.5 0.115877 

12 0.13329 
12.5 0.147308 

13 0.156417 
13.5 0.159577 
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14 0.156417 
14.5 0.147308 

15 0.13329 
15.5 0.115877 

16 0.096788 
16.5 0.077674 

17 0.059891 
17.5 0.044368 

18 0.03158 
18.5 0.021596 

19 0.01419 
19.5 0.008958 

20 0.005433 
20.5 0.003166 

21 0.001773 

The chart wizard is used to plot these points. The result is shown in Fig. 7-9. 



0.18 
0.16 
0.14 
0.12 
_ 0.1 
0.08 
0.06 
0.04 
0.02 


3.5 5.5 7.5 9.5 11.5 13.5 15.5 17.5 19.5 21.5 23.5 
X 

Fig. 7-9 The EXCEL plot of the normal curve having mean= 13.5 and standard deviation = 2.5. 




7.21 Determine the second quartile (Q 2 ), the third quartile (g 3 ), and the 90th percentile (P 90 ) for 
the times that college students spend on the internet. Use the normal distribution given in 
Problem 7.20. 

SOLUTION 

The second quartile or 50th percentile for the normal distribution will occur at the center of the curve. 
Because of the symmetry of the distribution it will occur at the same point as the mean. In the case of the 
internet usage, that will be 13.5 hours per week. The EXCEL function =NORMINV(0.5, 13.5,2.5) is used to 
find the 50th percentile. The 50th percentile means that 0.5 of the area is to the left of the second quartile, 
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and the mean is 13.5 and the standard deviation is 2.5. The result that EXCEL gives is 13.5. Using 
MINITAB, we find: 

Inverse Cumulative Distribution Function 

Normal with mean =13.5 and standard deviation = 2.5 
p(X< = x) x 
0.5 13.5 

Figure 7-10(a) illustrates this discussion. Referring to Fig. 7-10(6), we see that 75% of the area 
under the curve is to the left of Q 3 . The EXCEL function =NORMINV(0.75,13.5,2.5) gives Q 3 = 15.19. 




second quartile = 13.5 
(a) 
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Figure 7-10 (c) shows that 90% of the area is to the left of P 90 . The EXCEL function =NORMINV 
(0.90,13.5,2.5) gives P go = 16.70. 



7.22 Using Appendix II, find g 3 , in Problem 7.21. 
SOLUTION 

When software such as EXCEL or M1NITAB is not available you must work with the standard normal 
distribution since it is the only one available in tables. 



z=0 

Fig. 7-11 Standard normal curve. 



Using Appendix II in reverse, we see that the area from z = to 2 = 0.67 is 0.2486 and from z = 
to z = 0.68 is 0.2518 (see Fig. 7-11). The area from -infinity to z = 0.675 is approximately 0.75 since there 
is 0.5 from —infinity to and 0.25 from to z = 0.675. Therefore 0.675 is roughly the third quartile on the 
standard normal curve. Let Q 3 be the third quartile on the normal curve having mean= 13.5 and standard 
deviation = 2.5 [see Fig. 7-10 (b)]. Now when Q 3 is transformed to a z-value, we have 0.675 = (Q 3 - 13.5)/2.5. 
Solving this equation for Q 3 , we have Q 3 = 2. 5(0.675)+ 13.5 = 15.19, the same answer EXCEL gave in 
Problem 7.21. 



7.23 Washers are produced so that their inside diameter is normally distributed with mean 0.500 inches 
(in) and standard deviation equal to 0.005 in. The washers are considered defective if their 
inside diameter is less than 0.490 in or greater than 0.5 10 in. Find the percent defective using 
Appendix II and EXCEL. 

SOLUTION 



0.490 in standard units it 



0.005 
0.510-0.500 



From Appendix II, the area to the right of Z = 2.00 is 0.5 - 0.4772 or 0.0228. The area to the left of 
Z=-2.00 is 0.0228. The percent defective is (0.0228 + 0.0228) x 100 = 4.56%. When finding areas under 
a normal curve using Appendix II, we convert to the standard normal curve to find the answer. 
If EXCEL is used, the answer is = 2*NORMDIST(0.490, 0.500, 0.005, 1) which also gives 4.56%. 
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NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 

7.24 Find the probability of getting between 3 and 6 heads inclusive in 10 tosses of a fair coin by using 
(a) the binomial distribution and (b) the normal approximation to the binomial distribution. 

SOLUTION 

w -k.^-c/)®'®'-^ -*.*-.)-(?)(!)'©'-£ 

Pr{between 3 and 6 heads inclusive} = 7^ + ^ + ^ + ^ = T^ = °- 7734 
(&) The EXCEL plot of the binomial distribution for N= 10 tosses of a fair coin is shown in Fig. 7-13. 

Note that even though the binomial distribution is discrete, it has the shape of the continuous 
normal distribution. When approximating the binomial probability at 3, 4, 5, and 6 heads by the area 
under the normal curve, find the normal curve area from X=2.5 to X=6.5. The 0.5 that you go on 
either side of X= 3 and X= 6 is called the continuity correction. The following are the steps to follow 
when approximating the binomial with the normal. Choose the normal curve with mean 
iVp= 10(0.5) = 5 and standard deviation = ^Npq = ^10(0.5X0.5) = 1.58. You are choosing the nor- 
mal curve with the same center and variation as the binomial distribution. Then find the area under the 
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0.3 



0.25 



0.2 
| 0.15 



0.1 



0.05 - # 4 

4 1 1 1 1 — - — t 

2 4 6 8 10 12 

Number of heads 

Fig. 7-13 EXCEL plot of the binomial distribution for N= 10 and p = 0.5. 



0.25 



0.20 



0.15 



0.10 



0.05 



0.00 



01 23456789 10 
Fig. 7-14 Normal approximation to 3, 4, 5, or 6 heads when a coin is tossed 10 times. 

curve from 2.5 to 6.5 as shown in Fig. 7-14. This is the normal approximation to the binomial 
distribution. 

The solution when using an EXCEL worksheet is given by =NORMDlST(6.5, 5, 1.58, 1) - NORMDIST 
(2.5, 5, 1.58, 1) which equals 0.7720. 

If the technique using Appendix II is applied, the normal values 6.5 and 2.5 are first converted to standard 
normal values. (2.5 in standard units is -1.58 and 6.5 in standard units is 0.95.) The area between -1.58 and 
0.95 from Appendix II is 0.7718. Whichever method is used the answer is very close to the binomial answer 
of 0.7734. 

7.25 A fair coin is tossed 500 times. Find the probability that the number of heads will not differ from 
250 by (a) more than 10 and (b) more than 30. 
SOLUTION 



^ = ^=(500)0=250 a= ^(500)00 = 11.18 
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We require the probability that the number of heads will lie between 240 and 260 or, considering the 
data to be continuous, between 239.5 and 260.5. Since 239.5 in standard units is (239.5 - 250)/ 
11.18 = -0.94, and 260.5 in standard units is 0.94, we have 

Required probability = (area under normal curve between z = -0.94 and z = 0.94) 

= (twice area between z = and z = 0.94) = 2(0.3264) = 0.6528 

We require the probability that the number of heads will lie between 220 and 280 or, considering the 
data to be continuous, between 219.5 and 280.5. Since 219.5 in standard units is (219.5 — 250)/ 
11.18 = -2.73, and 280.5 in standard units is 2.73, we have 

Required probability = (twice area under normal curve between z = and z = —2.73) 
= 2(0.4968) = 0.9936 

It follows that we can be very confident that the number of heads will not differ from that expected (250) 
by more than 30. Thus if it turned out that the actual number of heads was 280, we would strongly 
believe that the coin was not fair (i.e., was loaded). 

7.26 Suppose 75% of the age group 1 through 4 years regularly utilize seat belts. Find the probability 
that in a random stop of 100 automobiles containing 1 through 4 year olds, 70 or fewer are found 
to be wearing a seat belt. Find the solution using the binomial distribution as well as the normal 
approximation to the binomial distribution. Use MINITAB to find the solutions. 

SOLUTION 

The MINITAB output given below shows that the probability that 70 or fewer will be found to be 
wearing a seat belt is equal to 0.1495. 

MTB > cdf 70; 

SUBC> binomial 100 . 75 . 

Cumulative Distribution Function 

Binomial with n = 100 and p = . 750000 

x P( X <= x) 
70.00 0.1495 

The solution using the normal approximation to the binomial distribution is found as follows: The 
mean of the binomial distribution is fj, = Np = 100(0.75) = 75 and the standard deviation is 
a = ^/Npq = yT00(0. 75)(0.25) = 4.33. The MINITAB output given below shows the normal approxima- 
tion to equal 0.1493. The approximation is very close to the true value. 

MTB > cdf 70.5; 

SUBC> normal mean = 75 sd = 4 . 33 . 
Cumulative Distribution Function 

Normal with mean = 75 .0000 and standard deviation = 4. 33000 

x P(X^x) 
70.5000 0.1493 



THE POISSON DISTRIBUTION 

7.27 Ten percent of the tools produced in a certain manufacturing process turn out to be defective. 
Find the probability that in a sample of 10 tools chosen at random exactly 2 will be defective 
by using (a) the binomial distribution and (b) the Poisson approximation to the binomial 
distribution. 
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SOLUTION 

The probability of a defective tool is p = 0. 1 . 

(a) Pr{2 defective tools in 10} = (0.1) 2 (0.9) 8 = 0.1937 or 0.19 

(b) With X = Np = 10(0.1) = 1 and using e = 2.718, 

\ x e- x (l)V' e- 1 1 
Pr{2 defective tools in 10} = — ^- = 1 '— = — = — = 0.1839 or 0.18 

In general, the approximation is good if p < 0.1 and A = Np < 5. 

7.28 If the probability that an individual suffers a bad reaction from injection of a given serum is 0.001, 
determine the probability that out of 2000 individuals (a) exactly 3 and (b) more than 2 individ- 
uals will suffer a bad reaction. Use MINITAB and find the answers using both the Poisson and 
the binomial distributions. 

SOLUTION 

(a) The following MINITAB output gives first the binomial probability that exactly 3 suffer a bad reaction. 
Using A = Np = (2000) (0.001) = 2, the Poisson probability is shown following the binomial probabil- 
ity. The Poisson approximation is seen to be extremely close to the binomial probability. 

MTB >pdf 3; 

SUBC> binomial 2000.001. 
Probability Density Function 

Binomial with n = 2000 and p = . 001 
x P( X = x) 

3.0 0.1805 

MTB >pdf 3; 
SUBC> poisson 2 . 

Probability Density Function 

Poisson with mu = 2 
x P( X = x) 

3.00 0.1804 

(b) The probability that more than 2 individuals suffer a bad reaction is given by 1 — P(X < 2). The 
following MINITAB output gives the probability that X < 2 as 0.6767 using both the binomial and 
the Poisson distribution. The probability that more than 2 suffer a bad reaction is 1 - 0.6767 = 0.3233. 

MTB > cdf 2 ; 

SUBC> binomial 2000 .001. 
Cumulative Distribution Function 

Binomial with n = 2000 and p = . 001 
x P ( X^x) 

MTB > cdf 2 ; 
SUBC> poisson 2 . 

Cumulative Distribution Function 

Poisson with mu = 2 

x P( X^ x) 

2.00 0.6767 
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7.29 A Poisson distribution is given by 

Find (a)p(0), (b)p(l), (c)p(2), and (d) p(3). 
SOLUTION 

(a) p(0) = ^ J2 ^ = — { = e- 0J2 = 0.4868 using Appendix VIII 

(b) p{\) = ( 0J2 ^? = (0J2)e- OJ2 = (0.72)(0.4868) = 0.3505 

(c) p(2) = C 3 - 72 ^ 0,72 = (0-5 1 84)e °' 72 = (02592)(0 _ 4868) = q.1262 
Another method 

p{2) = p(l) = (0.36)(0.3505) = 0.1262 
(0 12) 3 e~ 0J2 72 

(d) p(3) = = —P(2) = (0.24)(0.1262) = 0.0303 

THE MULTINOMIAL DISTRIBUTION 

7.30 A box contains 5 red balls, 4 white balls, and 3 blue balls. A ball is selected at random from the 
box, its color is noted, and then the ball is replaced. Find the probability that out of 6 balls 
selected in this manner, 3 are red, 2 are white, and 1 is blue. 

SOLUTION 

Pr{red at any drawing} = ^, Pr{white at any drawing} = ^, and Pr{blue at any drawing} = ^; thus 
Pr { 3are red, 2are white, Hsblue}^^) 3 ^) 2 ^) 1 ^ 

FITTING OF DATA BY THEORETICAL DISTRIBUTIONS 

7.31 Fit a binomial distribution to the data of Problem 2.17. 
SOLUTION 

We have Pr{X heads in a toss of 5 pennies} = p(X) = (y)p x cf~ x , where p and q are the respective 
probabilities of a head and a tail on a single toss of a penny. By Problem 7.11(a), the mean number of heads 
is (i = Np = 5p. For the actual (or observed) frequency distribution, the mean number of heads is 

N : A' (38)(0) + (144)(1) + (342)(2) + (287)(3) + (164) (4) + (25)(5) = 2470 = 

J2f iooo iooo 

Equating the theoretical and actual means, 5p = 2.47, or p = 0.494. Thus the fitted binomial distribu- 
tion is given by p(X) = (j.)(0.494) x (0.506) 5 ^. 

Table 7.4 lists these probabilities as well as the expected (theoretical) and actual frequencies. The fit is 
seen to be fair. 
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Table 7.4 



Number of 




Expected 


Observed 


Heads (X) 


Pr{JT heads} 


Frequency 


Frequency 





0.0332 


33.2, or 33 


38 


1 


0.1619 


161.9, or 162 


144 


2 


0.3162 


316.2, or 316 


342 


3 


0.3087 


308.7, or 309 


287 


4 


0.1507 


150.7, or 151 


164 


5 


0.0294 


29.4, or 29 


25 



7.32 Use the Kolmogorov-Smirnov test in MINITAB to test the data in Table 7.5 for normality. 
The data represent the time spent on a cell phone per week for 30 college students. 




til 




Fig. 7-15 Kolmogorov-Smirnov test for normality: normal data, (a) Histogram reveals a set of data that is 
normally distributed; (b) Kolmogorov-Smirnov test of normality indicates normality />-value>0.150. 
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The histogram in Fig. 7-1 5(a) indicates that the data in this survey are normally distributed. The 
Kolmogorov-Smirnov test also indicates that the sample data came from a population that is normally 
distributed. Most statisticians recommend that if the p-value is less than 0.05, then reject normality. The 
p-value here is > 0.1 5. 

7.33 Use the Kolmogorov-Smirnov test in MINITAB to test the data in Table 7.6 for normality. The 
data represent the time spent on a cell phone per week for 30 college students. 



Table 7.6 



18 16 11 

17 12 13 

17 17 17 
16 18 17 
16 17 17 
16 18 15 

18 18 16 
16 16 17 
14 18 15 
11 18 10 



Histogram of Hours 




(a) 

Probability Plot of Hours 

Normal 




Mean 15.83 
StDev 2.291 
N 30 
KS 0.162 
P-Value 0.046 



10 12 14 16 18 20 22 



(b) 

Fig. 7-16 Kolmogorov-Smirnov test for normality: non-normal data />-value = 0.046. (a) Histogram reveals a set of 
data that is skewed to the left; (b) Kolmogorov-Smirnov test of normality indicates lack of normality. 
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Most statisticians recommend that if the p-value is less than 0.05, then reject normality. The p-value here is 
less than 0.05. 



7.34 Table 7.7 shows the number of days, /, in a 50-day period during which X automobile accidents 
occurred in a city. Fit a Poisson distribution to the data. 



Table 7.7 



Number of 


Number of 


Accidents (X) 


Days (J) 





21 


1 

2 


18 

7 


3 


3 


4 


1 




Total 50 



The mean number of accidents is 

x = Y,fX = (21)(0) + (18)0) + (7)(2) + (3)(3) + (1)(4) 45 
£ / 50 50 

Thus, according to the Poisson distribution, 

(0.90)* e-°- 90 



Pr{X accidents} = 

Table 7.8 lists the probabilities for 0, 1,2, 3, and 4 accidents as obtained from this Poisson distribution, 
as well as the expected or theoretical number of days during which X accidents take place (obtained by 
multiplying the respective probabilities by 50). For convenience of comparison, column 4 repeats the actual 
number of days from Table 7.7. 

Note that the fit of the Poisson distribution to the given data is good. 



Table 7.8 



Number of 




Expected 


Actual 


Accidents (X) 


Pr{X accidents} 


Number of Days 


Number of Days 





0.4066 


20.33, or 20 


21 


1 


0.3659 


18.30, or 18 


18 




0.1647 


8.24, or 8 


7 




0.0494 


2.47, or 2 


3 


4 


0.0111 


0.56, or 1 


1 



For a true Poisson distribution, the variance a 2 = A. Computing the variance of the given distribution 
gives 0.97. This compares favorably with the value 0.90 for A, and this can be taken as further evidence for 
the suitability of the Poisson distribution in approximating the sample data. 
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Supplementary Problems 

THE BINOMIAL DISTRIBUTION 

7.35 Evaluate (a) 7!, (b) 10!/(6!4!), (c) <§, (d) (' 8 '), and (e) (*). 

7.36 Expand (a) (q + p) 1 and (b) (q + p) 10 . 

7.37 Find the probability that in tossing a fair coin six times there will appear (a) 0. (b) 1, (c) 2, (d), 3, (e) 4, (/) 5, 
and (g) 6 heads, (ft) Use M1NITAB to build the distribution of X=the number of heads in 6 tosses of a coin. 

7.38 Find the probability of (a) 2 or more heads and (b) fewer than 4 heads in a single toss of 6 fair coins, (c) Use 
EXCEL to find the answers to (a) and (6). 

7.39 If X denotes the number of heads in a single toss of 4 fair coins, find (a) Pr{X = 3}, (b) Pr{X < 2}, 
(c) Pr{JT < 2}, and (rf) Pr{l < X < 3}. 

7.40 Out of 800 families with 5 children each, how many would you expect to have (a) 3 boys, (b) 5 girls, and 
(c) either 2 or 3 boys? Assume equal probabilities for boys and girls. 

7.41 Find the probability of getting a total of 1 1 (a) once and (b) twice in two tosses of a pair of fair dice. 

7.42 What is the probability of getting a 9 exactly once in 3 throws with a pair of dice? 

7.43 Find the probability of guessing correctly at least 6 of the 10 answers on a true-false examination. 

7.44 An insurance salesperson sells policies to 5 men, all of identical age and in good health. According to the 
actuarial tables, the probability that a man of this particular age will be alive 30 years hence is |. Find the 
probability that in 30 years (a) all 5 men, (b) at least 3 men, (c) only 2 men, and (d) at least 1 man will be alive, 
(<?) Use EXCEL to answer (a) through (d). 

7.45 Compute the (a) mean. (/>) standard dev iation, (c) moment coefficient of skewness, and (d) moment coeffi- 
cient of kurtosis for a binomial distribution in which p = 0.7 and N = 60. Interpret the results. 

7.46 Show that if a binomial distribution with N = 100 is symmetrical, its moment coefficient of kurtosis is 2.98. 

7.47 Evaluate (a) £ (X - pfp(X) and (b) £ (X - fi) 4 p{X) for the binomial distribution. 

7.48 Prove formulas (1) and (2) at the beginning of this chapter for the moment coefficients of skewness and 
kurtosis. 



THE NORMAL DISTRIBUTION 

7.49 On a statistics examination the mean was 78 and the standard deviation was 10. 

(a) Determine the standard scores of two students whose grades were 93 and 62, respectively. 

(b) Determine the grades of two students whose standard scores were -0.6 and 1.2, respectively. 

7.50 Find (a) the mean and (b) the standard deviation on an examination in which grades of 70 and 88 correspond 
to standard scores of -0.6 and 1.4, respectively. 

7.51 Find the area under the normal curve between (a) z = -1.20 and z = 2.40, (b) z = 1.23 and z = 1.87, and 

(c) z = -2.35 and z = -0.50, (d) Work parts (a) through (c) using EXCEL. 

7.52 Find the area under the normal curve (a) to the left of z = -1.78, (b) to the left of z = 0.56, (c) to the right of 
z = -1.45, (d) corresponding to z > 2.16, (e) corresponding to -0.80 < z < 1.53, and (/) to the left of 
z = -2.52 and to the right of z = 1.83, (g) Solve parts (a) through (f) using EXCEL. 
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7.53 If z is normally distributed with mean and variance 1, find (a) Pr{z > -1.64}, (b) Pr{-1.96 < z < 1.96}, 
and (c) Pr{|z| > 1}. 

7.54 Find the value of z such that (a) the area to the right of z is 0.2266, (b) the area to the left of z is 0.0314, 

(c) the area between -0.23 and z is 0.5722, (d) the area between 1.15 and z is 0.0730, and (e) the area between 
-z and z is 0.9000. 

7.55 Find z, if Pr{z > z,} = 0.84, where z is normally distributed with mean and variance 1. 

7.56 Find the ordinates of the normal curve at (a) z = 2.25, (b) z = -0.32, and (c) z = — 1.18 using Appendix I. 

(d) Find the answers to (a) through (c) using EXCEL. 

7.57 Adult males have normally distributed heights with mean equal to 70 in and standard deviation equal to 3 in. 
(a) What percent are shorter than 65 in? (b) What percent are taller than 72 in? (c) What percent are between 
68 and 73 in? 

7.58 The amount spent for goods online is normally distributed with mean equal to $125 and standard deviation 
equal to $25 for a certain age group, (a) What percent spend more than $175? (b) What percent spend 
between $100 and $150? (c) What percent spend less than $50? 

7.59 The mean grade on a final examination was 72 and the standard deviation was 9. The top 10% of the 
students are to receive A's. What is the minimum grade that a student must get in order to receive an A? 

7.60 If a set of measurements is normally distributed, what percentage of the measurements differ from the mean 
by (a) more than half the standard deviation and (b) less than three-quarters of the standard deviation? 

7.61 If X is the mean and s is the standard deviation of a set of normally distributed measurements, what 
percentage of the measurements are (a) within the range X±2s, (b) outside the range X± 1.2s, and 
(c) greater than X - 1.5.?? 

7.62 In Problem 7.61, find the constant a such that the percentage of the cases (a) within the range X ± as is 75% 
and (b) less than X - as is 22%. 



NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 

7.63 Find the probability that 200 tosses of a coin will result in (a) between 80 and 120 heads inclusive, (b) less 
than 90 heads, (c) less than 85 or more than 115 heads, and (d) exactly 100 heads. 

7.64 Find the probability that on a true-false examination a student can guess correctly the answers to (a) 12 or 
more out of 20 questions and (b) 24 or more out of 40 questions. 

7.65 Ten percent of the bolts that a machine produces are defective. Find the probability that in a random sample 
of 400 bolts produced by this machine, (a) at most 30, (b) between 30 and 50, (c) between 35 and 45, and 
(d) 55 or more of the bolts will be defective. 

7.66 Find the probability of getting more than 25 sevens in 100 tosses of a pair of fair dice. 



THE POISSON DISTRIBUTION 

7.67 If 3% of the electric bulbs manufactured by a company are defective, find the probability that in a sample of 
100 bulbs (a) 0, (b) 1, (c) 2, (d) 3, (<?) 4, and (/) 5 bulbs will be defective. 

7.68 In Problem 7.67, find the probability that (a) more than 5, (b) between 1 and 3, and (c) less than or equal to 
2 bulbs will be defective. 
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7.69 A bag contains 1 red and 7 white marbles. A marble is drawn from the bag and its color is observed. Then 
the marble is put back into the bag and the contents are thoroughly mixed. Using (a) the binomial distribu- 
tion and (b) the Poisson approximation to the binomial distribution, find the probability that in 8 such 
drawings a red ball is selected exactly 3 times. 

7.70 According to the National Office of Vital Statistics of the U.S. Department of Health, Education, and 
Welfare, the average number of accidental drownings per year in the United States is 3.0 per 100,000 
population. Find the probability that in a city of population 200,000 there will be (a) 0, (b) 2, (c) 6, (d) 8, 
(e) between 4 and 8, and (/) fewer than 3 accidental drownings per year. 

7.71 Between the hours of 2 and 4 p.m. the average number of phone calls per minute coming into the switch- 
board of a company is 2.5. Find the probability that during one particular minute there will be (a) 0, (b) 1, 
(c) 2, (d) 3, (<?) 4 or fewer, and (J) more than 6 phone calls. 



THE MULTINOMIAL DISTRIBUTION 

7.72 A fair die is tossed six times. Find the probability (a) that one 1, two 2's, and three 3's turn up and (h) that 
each side turns up only once. 

7.73 A box contains a very large number of red, white, blue, and yellow marbles in the ratio 4:3:2: 1, respec- 
tively. Find the probability that in 10 drawings (a) 4 red, 3 white, 2 blue, and 1 yellow marble will be drawn 
and (b) 8 red and 2 yellow marbles will be drawn. 

7.74 Find the probability of not getting a 1, 2, or 3 in four tosses of a fair die. 

FITTING OF DATA BY THEORETICAL DISTRIBUTIONS 

7.75 Fit a binomial distribution to the data in Table 7.9. 



Table 7.9 



X 


12 3 4 


f 


30 62 46 10 2 



7.76 A survey of middle-school students and their number of hours of exercise per week was determined. 
Construct a histogram of the data using STATISTIX. Use the Shapiro-Wilk test of STATISTIX to deter- 
mine if the data were taken from a normal distribution. The data are shown in Table 7.10. 




7.77 Using the data in Table 7.5 of Problem 7.32, construct a histogram of the data using STATISTIX. Use the 
Shapiro-Wilk test of STATISTIX to determine if the data were taken from a normal distribution. 



7.78 The test scores in Table 7.11 follow a U distribution. This is just the opposite of a normal distribution. Using 
the data in Table 7.11, construct a histogram of the data using STATISTIX. Use the Shapiro-Wilk test of 
STATISTIX to determine if the data were taken from a normal distribution. 
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7.79 Use the test scores in Table 7.11, the Anderson-Darling test in MINITAB, and the Ryan-Joiner in 
MINITAB to test that the data came from a normal population. 

7.80 For 10 Prussian army corps units over a period of 20 years (1875 to 1894), Table 7.12 shows the number of 
deaths per army corps per year resulting from the kick of a horse. Fit a Poisson distribution to the data. 



Table 7.12 



X 


12 3 4 


f 


109 65 22 3 1 



Elementary 
Sampling Theory 



SAMPLING THEORY 

Sampling theory is a study of relationships existing between a population and samples drawn from 
the population. It is of great value in many connections. For example, it is useful in estimating unknown 
population quantities (such as population mean and variance), often called population parameters or 
briefly parameters, from a knowledge of corresponding sample quantities (such as sample mean and 
variance), often called sample statistics or briefly statistics. Estimation problems are considered in 
Chapter 9. 

Sampling theory is also useful in determining whether the observed differences between two samples 
are due to chance variation or whether they are really significant. Such questions arise, for example, in 
testing a new serum for use in treatment of a disease or in deciding whether one production process is 
better than another. Their answers involve the use of so-called tests of significance and hypotheses that are 
important in the theory of decisions. These are considered in Chapter 10. 

In general, a study of the inferences made concerning a population by using samples drawn from it, 
together with indications of the accuracy of such inferences by using probability theory, is called 
statistical inference. 



RANDOM SAMPLES AND RANDOM NUMBERS 

In order that the conclusions of sampling theory and statistical inference be valid, samples must be 
chosen so as to be representative of a population. A study of sampling methods and of the related 
problems that arise is called the design of the experiment. 

One way in which a representative sample may be obtained is by a process called random sampling. 
according to which each member of a population has an equal chance of being included in the 
sample. One technique for obtaining a random sample is to assign numbers to each member of 
the population, write these numbers on small pieces of paper, place them in an urn, and then draw 
numbers from the urn, being careful to mix thoroughly before each drawing. An alternative method 
is to use a table of random numbers (see Appendix IX) specially constructed for such purposes. 
See Problem 8.6. 
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SAMPLING WITH AND WITHOUT REPLACEMENT 

If we draw a number from an urn, we have the choice of replacing or not replacing the number into 
the urn before a second drawing. In the first case the number can come up again and again, whereas in 
the second it can only come up once. Sampling where each member of the population may be chosen 
more than once is called sampling with replacement, while if each member cannot be chosen more than 
once it is called sampling without replacement. 

Populations are either finite or infinite. If, for example, we draw 10 balls successively 
without replacement from an urn containing 100 balls, we are sampling from a finite population; 
while if we toss a coin 50 times and count the number of heads, we are sampling from an infinite 
population. 

A finite population in which sampling is with replacement can theoretically be considered infinite, 
since any number of samples can be drawn without exhausting the population. For many practical 
purposes, sampling from a finite population that is very large can be considered to be sampling from 
an infinite population. 



SAMPLING DISTRIBUTIONS 

Consider all possible samples of size N that can be drawn from a given population (either with or 
without replacement). For each sample, we can compute a statistic (such as the mean and the standard 
deviation) that will vary from sample to sample. In this manner we obtain a distribution of the statistic 
that is called its sampling distribution. 

If, for example, the particular statistic used is the sample mean, then the distribution is called 
the sampling distribution of means, or the sampling distribution of the mean. Similarly, we could have 
sampling distributions of standard deviations, variances, medians, proportions, etc. 

For each sampling distribution, we can compute the mean, standard deviation, etc. Thus we can 
speak of the mean and standard deviation of the sampling distribution of means, etc. 



SAMPLING DISTRIBUTION OF MEANS 

Suppose that all possible samples of size N are drawn without replacement from a finite 
population of size N p > N. If we denote the mean and standard deviation of the sampling 
distribution of means by p x an d °~ x ar *d tne population mean and standard deviation by p and a, 
respectively, then 

a lN p -N 

Uy = u and cr Y = —=A— 7 (7) 

pz p x y /Vp - 1 w 

If the population is infinite or if sampling is with replacement, the above results reduce to 

H2 = H and a x=^ (2) 

For large values of N (7V>30), the sampling distribution of means is approximately a normal 
distribution with mean p x and standard deviation a x , irrespective of the population (so long as the 
population mean and variance are finite and the population size is at least twice the sample size). This 
result for an infinite population is a special case of the central limit theorem of advanced probability 
theory, which shows that the accuracy of the approximation improves as N gets larger. This is sometimes 
indicated by saying that the sampling distribution is asymptotically normal. 
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In case the population is normally distributed, the sampling distribution of means is also normally 
distributed even for small values of N (i.e., N< 30). 



SAMPLING DISTRIBUTION OF PROPORTIONS 

Suppose that a population is infinite and that the probability of occurrence of an event (called its 
success) is p, while the probability of nonoccurrence of the event is q = 1 - p. For example, the popula- 
tion may be all possible tosses of a fair coin in which the probability of the event "heads" is p = \. 
Consider all possible samples of size N drawn from this population, and for each sample determine the 
proportion P of successes. In the case of the coin, P would be the proportion of heads turning up in N 
tosses. We thus obtain a sampling distribution of proportions whose mean \x P and standard deviation o> 
are given by 

which can be obtained from equations (2) by placing /j, = p and a = ^fpq. For large values of N (N>30), 
the sampling distribution is very closely normally distributed. Note that the population is binomially 
distributed. 

Equations (3) are also valid for a finite population in which sampling is with replacement. For finite 
populations in which sampling is without replacement, equations (J) are replaced by equations (7) with 
p, = p and cr = y/pq. 

Note that equations (J) are obtained most easily by dividing the mean and standard deviation (Np 
and x/Npq) of the binomial distribution by TV (see Chapter 7). 



SAMPLING DISTRIBUTIONS OF DIFFERENCES AND SUMS 

Suppose that we are given two populations. For each sample of size N { drawn from the first 
population, let us compute a statistic Su this yields a sampling distribution for the statistic S u whose 
mean and standard deviation we denote by p, sl and a sl , respectively. Similarly, for each sample of size 
N 2 drawn from the second population, let us compute a statistic S 2 ; this yields a sampling distribution 
for the statistic S 2 , whose mean and standard deviation are denoted by [i S2 and a S2 . From all possible 
combinations of these samples from the two populations we can obtain a distribution of the differences, 
Si - S 2 , which is called the sampling distribution of differences of the statistics. The mean and standard 
deviation of this sampling distribution, denoted respectively by Msi-S2 and o- S i_ s2 , are given by 

Msi-S2 = IJ-s\ - MS2 and o- S i_ S2 = \J a 2 sl + a 2 S2 (4) 

provided that the samples chosen do not in any way depend on each other (i.e., the samples are 
independent). 

If 5*1 , and S 2 are the sample means from the two populations — which means we denote by X x and 
X 2 , respectively — then the sampling distribution of the differences of means is given for infinite popula- 
tions with means and standard deviations {h\,cti) and (n 2 ,cr 2 ), respectively, by 



I /erf aj 

M\-xi = M\ ~ Mi = Mi - V-i and (?x\-x2 = y + a \i = \l ^ + ^ ^ 

using equations (2). The result also holds for finite populations if sampling is with replacement. Similar 
results can be obtained for finite populations in which sampling is without replacement by using 
equations (7). 



Table 8.1 Standard Error for Sampling Distributions 



Sampling 
Distribution 




Standard 
Error 


Special Remarks 


Means 




This is true for large or small samples. The 
sampling distribution of means is very nearly 
normal for JV>30 even when the population is non- 
normal. 

Hx = /i, the population mean, in all cases. 


Proportions 


Gp = V 


!p(i -p) _ [pq 

N V N 


The remarks made for means apply here as 
/ip = p in all cases. 


Standard 
deviations 


V) 

(2) a 


V 4^ 2 


For N > 100, the sampling distribution of s 
is very nearly normal. 

a, is given by (7) only if the population is 
normal (or approximately normal). If the popula- 
tion is nonnormal, (2) can be used. 

Note that (2) reduces to (7) when \i 2 = °" 2 an d 
m = 3cr 4 , which is true for normal populations. 

For N> 100, fi s = ct very nearly 


Medians 




1.2533(7 

/2iV~ y/Jf 


For 7V>30, the sampling distribution of the 
median is very nearly normal. The given result holds 
only if the population is normal (or approximately 
normal). 

Mmcd = H 


First and third 


1.3626a 

"fll-^-^r- 


The remarks made for medians apply here as well. 
Hqi and fi Q3 are very nearly equal to the first 

Note that a Q2 = (7 mcd 




1.7094a 




Deciles 


1.4288a 
1.3180(7 
1.2680(7 


The remarks made for medians apply here as 

well. 

Mz)i» A*£»2» - - - are ver y nearly equal to the first, 
second, . . . deciles of the population. 
Note that a D5 = (7 mcd . 


Semi-interquartile 
ranges a<2 


0.7867(7 


The remarks made for medians apply here as 

fiQ is very nearly equal to the population semi- 
interquartile range 


Variances 


(1) <V 




The remarks made for standard deviation apply 
here as well. Note that (2) yields (1) in the case that the 


(2) a# = \ 


A^-3 2 


population is normal 

Ms 2 = a 2 (N - l)/N, which is very nearly a 2 for 




JV 


large N. 


Coefficients 
of variation 


a v = —^= \/\+2v 2 
s/2N 


Here v = a/fi is the population coefficient of 
variation. The given result holds for normal (or 
nearly normal) populations and N> 100. 
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Corresponding results can be obtained for the sampling distributions of differences of proportions 
from two binomially distributed populations with parameters {p\,q\) and (p 2 ,q 2 ), respectively. In this 
case S x and S 2 correspond to the proportion of successes, P\ and P 2 , and equations (4) yield the results 

Vp\-f2 = fJ-pi ~ Hp2 =P\-Pi and op\-P2 = \J oj>i + op 2 = \J^f~^^ ^ 

If N u and N 2 are large (N U N 2 > 30), the sampling distributions of differences of means or proportions 
are very closely normally distributed. 

It is sometimes useful to speak of the sampling distribution of the sum of statistics. The mean and 
standard deviation of this distribution are given by 



Msi+S2 = /Asi + MS2 and <r S i+s2 = ^ a 2 sl + a 2 S2 (7) 
assuming that the samples are independent. 
STANDARD ERRORS 

The standard deviation of a sampling distribution of a statistic is often called its standard error. 
Table 8.1 lists standard errors of sampling distributions for various statistics under the conditions of 
random sampling from an infinite (or very large) population or of sampling with replacement from a 
finite population. Also listed are special remarks giving conditions under which results are valid and 
other pertinent statements. 

The quantities p, a, p, p r and X, s, P, m r denote, respectively, the population and sample means, 
standard deviations, proportions, and rth moments about the mean. 

It is noted that if the sample size N is large enough, the sampling distributions are normal or nearly 
normal. For this reason, the methods are known as large sampling methods. When TV < 30, samples are 
called small. The theory of small samples, or exact sampling theory as it is sometimes called, is treated 
in Chapter 11. 

When population parameters such as a, p, or p r , are unknown, they may be estimated closely by 
their corresponding sample statistics namely, ,v (or s = ^/N/(N - \)s), P, and m r — if the samples are 
large enough. 



SOFTWARE DEMONSTRATION OF ELEMENTARY SAMPLING THEORY 
EXAMPLE 1. 

A large population has the following random variable defined on it. X represents the number of 
computers per household and X is uniformly distributed, that is, p(x) = 0.25 for x= 1, 2, 3, and 4. In 
other words, 25% of the households have 1 computer, 25% have 2 computers, 25% have 3 computers, 
and 25% have 4 computers. The mean value of X is p = Y>xp(x) = 0.25 + 0.5 + 0.75 + 1 — 2.5. 
The variance of X is o 2 = Y,x 2 p(x) - p 2 = 0.25 + 1 + 2.25 + 4 - 6.25 = 1.25. We say that the mean 
number of computers per household is 2.5 and the variance of the number of computers is 1.25 per 
household. 

EXAMPLE 2. 

MINITAB may be used to list all samples of two households taken with replacement. The worksheet 
would be as in Table 8.2. The 16 samples are shown in CI and C2 and the mean for each sample in C3. 
Because the population is uniformly distributed, each sample mean has probability 1/16. Summarizing, 
the probability distribution is given in C4 and C5. 

Note that p^ = Y,xp{x) = 1(0.0625) + 1.5(0.1250) + • • • + 4(0.0625) = 2.5. We see that p, x = p,. 
Also, a\ = Y,x 2 p{x) - p 2 = 1(0.0625) + 2.25(0.1250) + • • • + 16(0.0625) - 6.25 = 0.625 which gives 
a\ = (ct 2 /2). If MINITAB is used to draw the graph of the probability distribution of xbar, the result 
shown in Fig. 8-1 is obtained. (Note that X and xbar are used interchangeably.) 
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Table 8.2 



CI 


C2 


C3 


C4 


C5 


household 1 


household2 


mean 


xbar 


p(xbar) 


1 




1.0 


1.0 


0.0625 


1 


2 


1.5 


1.5 


0.1250 


1 


3 


2.0 


2.0 


0.1875 


1 


4 


2.5 


2.5 


0.2500 


2 


1 


1.5 


3.0 


0.1875 


2 




2.0 


3.5 


0.1250 


2 


3 


2.5 


4.0 


0.0625 


2 


4 


3.0 






3 


1 


2.0 






3 


2 


2.5 






3 


3 


3.0 






3 


4 


3.5 






4 


1 


2.5 






4 




3.0 






4 


3 


3.5 






4 


4 


4.0 







0.25 



0.20 
I 0.15 
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Fig. 8-1 Scatterplot of p(xbar) vs xbar. 



Solved Problems 



SAMPLING DISTRIBUTION OF MEANS 

8.1 A population consists of the five numbers 2, 3, 6, 8, and 11. Consider all possible samples of size 2 
that can be drawn with replacement from this population. Find (a) the mean of the population, 
(b) the standard deviation of the population, (c ) the mean of the sampling distribution of means, 
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and (d) the standard deviation of the sampling distribution of means (i.e., the standard error of 
means). 

SOLUTION 



2 + 3 + 6 + 8+11 



^ _ (2 - 6) 2 + (3 - 6) 2 + (6 - 6) 2 + (8 - 6) 2 + (1 1 - 6) 2 _ 16 + 9 + + 4 + 25 _ ^ 



(c) There are 5(5) = 25 samples of size 2 that can be drawn with replacement (since any one of the five 
numbers on the first draw can be associated with any one of the five numbers on the second draw). 
These are 



(2 


2) 


(2, 3) 


(2. 


6) 


(2, 8) 


(2, 


11) 


(3 


2) 


(3, 3) 


(3, 


6) 


(3, 8) 


(3, 


11) 


(6 


2) 


(6, 3) 


(6, 


6) 


(6, 8) 


(6, 


11) 


(8 


2) 


(8, 3) 


(8, 


6) 


(8, 8) 


(8, 


11) 


11 


2) 


(11,3) 


(11, 


6) 


(11,8) 


(11. 


11) 



The corresponding sample means are 



and the mean of sampling distribution of means is 

sum of all sample means in (8) 
Hx- ~ 



= ^ = 6.0 



illustrating the fact that fi x = fi. 

(d) The variance a x of the sampling distribution of means is obtained by subtracting the mean 6 from each 
number in (§), squaring the result, adding all 25 numbers thus obtained, and dividing by 25. The final 
result is o\ = 135/25 = 5.40, and thus a x = V^40 = 2.32. This illustrates the fact that for finite 
populations involving sampling with replacement (or infinite populations), a\ = a 2 /N since the 
right-hand side is 10.8/2 = 5.40, agreeing with the above value. 



8.2 Solve Problem 8.1 for the case that the sampling is without replacement. 
SOLUTION 

As in parts (a) and (b) of Problem 8.1, fj, = 6 and a = 3.29. 

(c) There are ( 5 2 ) = 10 samples of size 2 that can be drawn without replacement (this means that we draw 
one number and then another number different from the first) from the population: (2, 3), (2, 6), (2, 8), 
(2, 1 1), (3, 6), (3, 8), (3, 1 1), (6, 8), (6, 1 1), and (8, 1 1). The selection (2, 3), for example, is considered the 
same as (3, 2). 

The corresponding sample means are 2.5, 4.0, 5.0, 6.5, 4.5, 5.5, 7.0, 7.0, 8.5, and 9.5, and the mean 
of sampling distribution of means is 

_ 2.5 + 4.0 + 5.0 + 6.5 + 4.5 + 5.5 + 7.0 + 7.0 + 8.5 + 9.5 
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illustrating the fact that fi x = fj,. 
(d) The variance of sampling disiribulion of means is 



_ (2.5 - 6.0)' + (4.0 - 6.0) 2 + (5.0 - 6.0)' + ■ ■ ■ + (9.5 - 6.0) 2 
10 



and a x — 2.01. This illustrates 



since the right side equals 



s obtained above. 



8.3 Assume that the heights of 3000 male students at a university are normally distributed with mean 
68.0 inches (in) and standard deviation 3.0 in. If 80 samples consisting of 25 students each are 
obtained, what would be the expected mean and standard deviation of the resulting sampling 
distribution of means if the sampling were done (a) with replacement and (b) without replacement? 



SOLUTION 

The numbers of samples of size 25 thai could be obtained theoretically from a group of 3000 students w ith 
and without replacement are (3000) 25 and ( 3 25°), which are much larger than 80. Hence we do not get a true 
sampling distribution of means, but only an experimental sampling distribution. Nevertheless, since the number 
of samples is large, there should be close agreement between the two sampling distributions. Hence the 
expected mean and standard deviation would be close to those of the theoretical distribution. Thus we have: 

(a) w = ^t = 68.0in and a x = = -^L = 0.6in 

a lN„-N 3 /3000 - 25 
(h) W = 68.0m and a , = —^^— J = —^^^ 

which is only very slightly less than 0.6 in and can therefore, for all practical purposes, be considered the 
same as in sampling with replacement. 

Thus we would expect the experimental sampling distribution of means to be approximately 
normally distributed with mean 68.0 in and standard deviation 0.6 in. 



8.4 In how many samples of Problem 8.3 would you expect to find the mean (a) between 66.8 and 68.3 
in and (b) less than 66.4 in? 

SOLUTION 

The mean X of a sample in standard units is here given by 
_ X - _ X - 68.0 
Z ~ a x ~ 0.6 
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Fig. 8-2 Areas under the standard normal curve, (a) Standard normal curve showing the area between z = — 2 and 
z = 0.5; (b) Standard normal curve showing the area to the left of z = -2.67. 



As shown in Fig. 8-2(a), 

Proportion of samples with means between 66.8 and 68.3 in 

= (area under normal curve between z = -2.0 and z = 0.5) 

= (area between z = — 2 and z — 0) + (area between z = and z = 0.5) 

= 0.4772 + 0.1915 = 0.6687 

Thus the expected number of samples is (80)(0.6687) = 53.496, or 53. 

(b) 66.4 in standard units = 66 ' 4 Q = -2.67 

As shown in Fig. 8.2(6), 

Proportion of samples with means less than 66.4 in = (area under normal curve to left of z = —2.67) 
= (area to left of z = 0) 

-(area between z = -2.67 and z = 0) 
= 0.5 - 0.4962 = 0.0038 

Thus the expected number of samples is (80)(0.0038) = 0.304, or zero. 

8.5 Five hundred ball bearings have a mean weight of 5.02 grams (g) and a standard 
deviation of 0.30 g. Find the probability that a random sample of 100 ball bearings chosen 
from this group will have a combined weight of (a) between 496 and 500 g and (b) more than 
510 g. 
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SOLUTION 

For the sample distribution of rr 



>, Hx = fJ. = 5.02 g, and 



a N p — N 0.30 /500- 100 

in weight of the 100 ball bearings lies 

4.96-5.02 



(a) The combined weight will lie between 496 and 500 g if the 
between 4.96 and 5.00 g. 



4.96 in standard u 
5.00 in standard u 



0.0027 
5.00-5.02 



2.22 
= -0.74 



As shown in Fig. 8-3(a), 



Required probability = (area between z = -2.22 and - = -0.74) 

= (area between z = -2.22 and z = 0) - (area between z = -0.74 and z = 0) 
= 0.4868 - 0.2704 = 0.2164 





Fig. 8-3 Sample probabilities are found as areas under the standard normal curve. (</) Standard normal curve 
showing the area between z = -2.22 and z = -0.74; (b) Standard normal curve showing the area to 
the right of z = 2.96. 

(b) The combined weight will exceed 510 g if the mean weight of the 100 bearings exceeds 5.10g. 

. , , - 5.10-5.02 „ njr 
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As shown in Fig. 8-3(6), 

Required probability = (area to right of z = 2.96) 

= (area to right of z = 0) - (area between z = and z = 2.96) 
= 0.5 - 0.4985 = 0.0015 

Thus there are only 3 chances in 2000 of picking a sample of 100 ball bearings with a combined weight 
exceeding 510 g. 

8.6 (a) Show how to select 30 random samples of 4 students each (with replacement) from Table 2. 1 
by using random numbers. 

(b) Find the mean and standard deviation of the sampling distribution of means in part (a). 

(c) Compare the results of part (b) with theoretical values, explaining any discrepancies. 

SOLUTION 

(a) Use two digits to number each of the 100 students: 00, 01, 02, ... ,99 (see Table 8.3). Thus the 5 students 
with heights 60-62 in are numbered 00-04, the 18 students with heights 63-65 in are numbered 05-22, 
etc. Each student number is called a sampling number. 



Height (in) 


Frequency 


Sampling Number 


60-62 


5 


00-04 


63-65 


18 


05-22 


66-68 


42 


23-64 


69-71 


27 


65-91 


72-74 


8 


92-99 



We now draw sampling numbers from the random-number table (Appendix IX). From the first 
line we find the sequence 51, 77, 27, 46, 40, etc., which we take as random sampling numbers, each of 
which yields the height of a particular student. Thus 51 corresponds to a student having height 66-68 in, 
which we take as 67 in (the class mark). Similarly, 77, 27, and 46 yield heights of 70, 67, and 67 in, 
respectively. 

By this process we obtain Table 8.4, which shows the sample numbers drawn, the corresponding 
heights, and the mean height for each of 30 samples. It should be mentioned that although we have 
entered the random-number table on the first line, we could have started anywhere and chosen any 
specified pattern. 

(b) Table 8.5 gives the frequency distribution of the sample mean heights obtained in part (a). This is a 
sampling distribution of means. The mean and the standard deviation are obtained as usual by the 
coding methods of Chapters 3 and 4: 

Mpan = 4 + rB=4 ^JZfit = t y m + (0.75) (23) = ^ ^ . 



Standard deviation = 



N 



N 



The theoretical mean ol the sampling distribution of moans, given by fig, should equal the population 
mean fi, which is 67.45 in (see Problem 3.22), in agreement with the value 67.58 in of part (b). 

The theoretical standard deviation (standard error) of the sampling distribution of means, given by 
o x , should equal ct/VTY, where the population standard deviation a = 2.92 in (see Problem 4.17) and 
the sample size N = 4. Since a/\fN = 2.92/\/4 = 1.46 in, we have agreement with the value 1.41 in of 
part (b). The discrepancies result from the fact that only 30 samples were selected and the sample size 
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Table 8.4 



Sample Number 


Corresponding 
Height 


Mean 
Height 


Sample Number 


Corresponding 
Height 


Mean 
Height 


1. 


51 


77 


27 


46 


67, 70, 67, 


67 


67.75 


16. 


11 


64 


55 


58 


64, 


67 


67 


67 


66.25 


2. 


40 


42 


33 


12 


67, 67, 67, 


64 


66.25 


17. 


70 


56 


97 


43 


70, 


67 


73, 


67 


69.25 


3. 


90 


44 


46 


62 


70, 67, 67, 


67 


67.75 


18. 


74 


28 


93 


50 


70, 


67 


73, 


67 


69.25 


4. 


16 


28 


98 


93 


64, 67, 73, 


73 


69.25 


19. 


79 


42 


71 


30 


70, 


67 


70 


67 


68.50 


5. 


58 


20 


41 


86 


67, 64, 67, 


70 


67.00 


20. 


58 


60 


21 


33 


67, 


67 


64 


67 


66.25 


6. 


19 


64 


08 


70 


64, 67, 64, 


70 


66.25 


21. 


75 


79 


74 


54 


70, 


70 


70 


67 


69.25 


7. 


56 


24 


03 


32 


67, 67, 61, 


67 


65.50 


22. 


06 


31 


04 


18 


64, 


67 


61 


64 


64.00 


8. 


34 


91 


83 


58 


67, 70, 70, 


67 


68.50 


23. 


67 


07 


12 


97 


70, 


64 


64 


73 


67.75 


9. 


70 


65 


68 


21 


70, 70, 70, 


64 




24. 


31 


71 


69 


88 


67, 


70 


70 


70 




10. 


96 


02 


13, 


87 


73, 61, 64, 


70 


67.00 


25. 


11 


64 


21 


87 


64, 


67 


64 


70 


66.25 


11. 


76 


10 


51 


08 


70, 64, 67, 


64 


66.25 


26. 


03 


58 


57, 


93 


61, 


67 


67 


73 


67.00 


12. 


63 


97 


45 


39 


67, 73, 67, 


67 


68.50 


27. 


53 


81 


93 


88 


67, 


70 


73, 


70 


70.00 


13. 


05 


81 


45 


93 


64, 70, 67, 


73 


68.50 


28. 


23 


22 


96 


79 


67, 


64 


73, 


70 


68.50 


14. 


96 


01 


73, 


52 


73, 61, 70, 


67 


67.75 


29. 


98 


56 


59 


36 


73, 


67 


67 


67 


68.50 


15. 


07 


82 


54 


24 


64, 70, 67, 


67 


67.00 


30. 


08 


15 


08 


84 


64, 


64 


64 


70 


65.50 



Table 8.5 



Sample Mean 


Tally 


/ 




fu 


fu 1 


64.00 




i 


-4 


-4 


16 


64.75 







-3 








65.50 


// 




-2 


-4 


8 


66.25 


W\ 


6 


-1 


-6 


6 


A -> 67.00 


mi 


4 











67.75 


mi 


4 


1 


4 


4 


68.50 




7 


2 


14 


28 


69.25 




5 


3 


15 


45 


70.00 


i 


1 


4 


4 


16 




£/ = iV=30 




E/« = 23 


£/V= 123 



SAMPLING DISTRIBUTION OF PROPORTIONS 

8.7 Find the probability that in 120 tosses of a fair coin (a) less than 40% or more than 60% will be 
heads and (b) | or more will be heads. 

SOLUTION 

First method 

We consider the 120 tosses of the coin to be a sample from the infinite population of all possible tosses 
of the coin. In this population the probability of heads is p = \ and the probability of tails is q = 1 - p = \. 
(a) We require the probability that the number of heads in 120 tosses will be less than 48 or more than 72. 

We proceed as in Chapter 7, using the normal approximation to the binomial. Since the number of 



CHAP. 8] 



ELEMENTARY SAMPLING THEORY 



215 



heads is a discrete variable, we ask for the probability that the number of heads is less than 47.5 o 
greater than 72.5. 



/i = expected number of heads = Np = 1200 = 60 and a = v r Npq = y/(120)(±)(±) = 5.4 



47.5 in standard units = 47,5 „ n 60 = -2.28 
5.48 



72.5 in standard units = ^ — = 2.28 

As shown in Fig. 8-4, 

Required probability = (area to the left of -2.28 plus area to the right of 2.28) 
= (2(0.0113) = 0.0226) 




-2.28 2.28 
Fig. 8-4 Normal approximation to the binomial uses the standard normal c 



Second method 



40% in standard u 



60% in standard u 



0.0456 
0.60-0.50 



Required probability = (area to the left of -2.19 plus area to the right of 2.19) 
= (2(0.0143) = 0.0286) 

Although this result is accurate to two significant figures, it does not agree exactly since we have not 
used the fact that the proportion is actually a discrete variable. To account for this, we subtract 
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\/2N= 1/2(120) from 0.40 and add \/2N = 1/2(120) to 0.60; thus, since 1/240 = 0.00417, the 
required proportions in standard units are 

0.40-0.00417- 0.50 „ „„ 0.60 + 0.00417 -0.50 „ „„ 

01)456 =- 2 - 28 and 01)456 = " 8 

so that agreement with the first method is obtained. 

Note that (0.40-0.00417) and (0.60 + 0.00417) correspond to the proportions 47.5/120 and 
72.5/120 in the first method. 

(b) Using the second method of part (a), we find that since § = 0.6250, 

(0.6250 - 0.00417) in standard units = = 2.65 

Required probability = (area under normal curve to right of z = 2.65) 

= (area to right of z = 0) - (area between z = and z = 2.65) 
= 0.5 - 0.4960 = 0.0040 



8.8 Each person of a group of 500 people tosses a fair coin 120 times. How many people should be 
expected to report that (a) between 40% and 60% of their tosses resulted in heads and (b) § or 
more of their tosses resulted in heads? 

SOLUTION 

This problem is closely related to Problem 8.7. Here we consider 500 samples, of size 120 each, from the 
infinite population of all possible tosses of a coin. 

(a) Part (a) of Problem 8.7 states that of all possible samples, each consisting of 120 tosses of a coin, we can 
expect to find 97.74% with a percentage of heads between 40% and 60%. In 500 samples we can thus 
expect to find about (97.74% of 500) = 489 samples with this property. It follows that about 489 people 
would be expected to report that their experiment resulted in between 40% and 60% heads. 

It is interesting to note that 500 - 489 = 1 1 people who would be expected to report that the 
percentage of heads was not between 40% and 60%. Such people might reasonably conclude that 
their coins were loaded even though they were fair. This type of error is an everpresent risk whenever 
we deal with probability. 

(b) By reasoning as in part (a), we conclude that about (500) (0.0040) = 2 persons would report that | or 
more of their tosses resulted in heads. 



It has been found that 2% of the tools produced by a certain machine are defective. What is the 
probability that in a shipment of 400 such tools (a) 3% or more and (b) 2% or less will prove 
defective? 



(a) First method 

Using the correction for discrete variables, 1/2N = 1/800 = 0.00125, we have 
(0.03 - 0.00125) in standard units = 0-03-0^00125 - 0.02 = ^ 



Required probability = (area under normal curve to right of z = 1.25) = 0.1056 
If we had not used the correction, we would have obtained 0.0764. 
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Another method 

(3% of 400) = 12 defective tools. On a continuous basis 12 or more tools means 11.5 or more. 

X = (2% of 400) = 8 and a = ^Nfq = ^(400) (0.02) (0.98) = 2.8 
Then, 11.5 in standard units = (11.5 — 8)/2.8 = 1.25, and as before the required probability is 0.1056. 

0.007 " = - 18 

Required probability = (area under normal curve to left of z = 0.18) 

= 0.5000 + 0.0714 = 0.5714 

If we had not used the correction, we would have obtained 0.5000. The second method of part (a) can 
also be used. 

3.10 The election returns showed that a certain candidate received 46% of the votes. Determine the 
probability that a poll of (a) 200 and (b) 1000 people selected at random from the voting 
population would have shown a majority of votes in favor of the candidate. 

SOLUTION 

(a) ,,=, = 0.46 and „ = ^ = = 0.0352 

Since 1 /27V = 1/400 = 0.0025, a majority is indicated in the sample if the proportion in favor of the 
candidate is 0.50 + 0.0025 = 0.5025 or more. (This proportion can also be obtained by realizing that 
101 or more indicates a majority, but as a continuous variable this is 100.5, and so the proportion is 
100.5/200 = 0.5025.) 

0.5025 in standard units = °^^ 6 = 1.21 

Required probability = (area under normal curve to right of z = 1.21) 
= 0.5000-0.3869 = 0.1131 



\m /(o 



1000 
0.5025 - 0.46 



0.5025 in standard units = Q0158 ' = 2 - 69 

Required probability = (area under normal curve to right of z = 2.6 
= 0.5000 - 0.4964 = 0.0036 



SAMPLING DISTRIBUTIONS OF DIFFERENCES AND SUMS 

8.11 Let U\ be a variable that stands for any of the elements of the population 3, 7, 8 and U 2 be 
a variable that stands for any of the elements of the population 2, 4. Compute (a) n m , (b) fi u2 , 
(c) nui-ui, (4) a uu (e) a U2 , and (J) a m _ u2 - 



(a) n m = mean of population U\ = 5 (3 + 7 + 8) = 6 

(b) = mean of population U 2 = \ (2 + 4) = 3 
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(c) The population consisting of the differences of any member of U\ and any member of U 2 is 

3-2 7-2 8-2 or 1 5 6 
3 -4 7-4 8 - 4 - 1 3 4 

Thus Hm-ui = mean of (tf, - Efc) = * + ^ + 6 + (-1) + 3 + 4 = 3 

This illustrates the general result fj.u\-u2 = Mc/i - ^(72, as seen from parts (a) and (6). 

2 • f i tj (3-6) 2 + (7-6) 2 + (8-6) 2 14 
(a) g ux = variance of population {/, = — ^ — = — 

/14 

or ^i = yy 

2 ■ , , t - TJ (2 - 3) 2 + (4 - 3) 2 
(e) = variance of population f/ 2 = = 1 or cr^ = 1 

(/) ° 2 ui-u2 = variance of population ({7, - U 2 ) 

_ (1 - 3) 2 + (5 - 3) 2 + (6 - 3) 2 + (-1 - 3) 2 + (3 - 3) 2 + (4 - 3) 2 _ 17 



This illustrates the general result for independent samples, uu\-u2 = y a u\ + °U2> as seen ^ Iom parts 
(d)and(e). 

8.12 The electric light bulbs of manufacturer A have a mean lifetime of 1400 hours (h) with a standard 
deviation of 200 h, while those of manufacturer B have a mean lifetime of 1200 h with a standard 
deviation of 100 h. If random samples of 125 bulbs of each brand are tested, what is the prob- 
ability that the brand A bulbs will have a mean lifetime that is at least (a) 160 h and (b) 250 h more 
than the brand B bulbs? 

SOLUTION 

Let X A and X B denote the mean lifetimes of samples A and B, respectively. Then 
Hx A -x B = M A ~ M B = 1400 - 1200 = 200 h 
frt 4 /(100) 2 (200) 2 „„, 

and ^ = ^t + t = f-nr + -nr = 20h 

The standardized variable for the difference in means is 

_ _ (X A - X B ) - ( Ma - Xb ) _ (X A - X B ) - 200 
a x A -x B 20 
and is very closely normally distributed. 

(a) The difference 160h in standard units is (160 - 200)/20 = -2. Thus 

Required probability = (area under normal curve to right of z = -2) 
= 0.5000 + 0.4772 = 0.9772 

(b) The difference 250 h in standard units is (250 - 200)/20 = 2.50. Thus 

Required probability = (area under normal curve to right of z = 2.50) 
= 0.5000 - 0.4938 = 0.0062 
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8.13 Ball bearings of a given brand weigh 0.50 g with a standard deviation of 0.02 g. What is the 
probability that two lots of 1000 ball bearings each will differ in weight by more than 2g? 

SOLUTION 

Let Xi and X 2 denote the mean weights of ball bearings in the two lots. Then 
M,-x 2 = M, - M 2 = 0-50 - 0.50 = 

and a ^ = ^N~ ] + N~ 2 '' 

The standardized variable for the difference ii 

_ (Xi - X 2 ) - 
2 ~ 0.000895 

and is very closely normally distributed. 

A difference of 2 g in the lots is equivalent to a difference of 2/1000 = 0.002 g in the means. This can 
occur either if X x - X 2 > 0.002 or X, - X 2 < -0.002; that is, 

z> 0.002- = 2 23 or -0.002 -Q 

- 0.000895 - 0.000895 

Then Pr{z > 2.23 or z < -2.23} = Pr{z > 2.23} + Pr{z < -2.23} = 2(0.5000 - 0.4871) = 0.0258. 

8.14 A and B play a game of "heads and tails," each tossing 50 coins. A will win the game if she tosses 5 or 
more heads than B; otherwise, B wins. Determine the odds against A winning any particular game. 

SOLUTION 

Let P A and P B denote the proportion of heads obtained by A and B. If we assume that the coins are 
all fair, the probability p of heads is \. Then 

VPa-Pb = »Pa ~ U Pb = 



'a-Pb = \J a P A + °Pb 



a p, 

The standardized variable for the difference in proportions is z = (P A — P B — 0)/0.10. 

On a continuous-variable basis, 5 or more heads means 4.5 or more heads, so that the difference in 
proportions should be 4.5/50 = 0.09 or more; that is, z is greater than or equal to (0.09 - 0)/0.10 = 0.9 (or 
z > 0.9). The probability of this is the area under the normal curve to the right of z = 0.9, which is 
(0.5000 - 0.3159) = 0.1841. 

Thus the odds against A winning are (1 - 0.1841) :0. 1841 = 0.8159:0.1841, or 4.43 to 1. 

8.15 Two distances are measured as 27.3 centimeters (cm) and 15.6 cm with standard deviations 
(standard errors) of 0.16 cm and 0.08 cm, respectively. Determine the mean and standard devia- 
tion of (a) the sum and (b) the difference of the distances. 

SOLUTION 

If the distances are denoted by D x and D 2 , then: 

(a) Mfli+ZH = I*di + V-di = 27.3 + 15.6 = 42.9 cm 

°Di+D2 = \l°m+°m = ^(0.16) 2 + (0.08) 2 = 0.18 cm 

(b) fJ-D\-D2 = Mdi ~ = 27.3 - 15.6 = 11.7 cm 

°d\-di = \j°m + a],! = \/(0.16) 2 + (0.08) 2 = 0.18cm 
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3.16 A certain type of electric light bulb has a mean lifetime of 1500h and a standard deviation of 
150h. Three bulbs are connected so that when one burns out, another will go on. Assuming 
that the lifetimes are normally distributed, what is the probability that lighting will take place for 
(a) at least 5000 h and (b) at most 4200 h? 

SOLUTION 

Assume the lifetimes to be L u L 2 , and L 3 . Then 

Vli+li+li = Mli + ML2 + to = 1500 + 1500 + 1500 = 4500 h 

VU+L2+U = y/oLl+oh + O 2 !* = ^3( 150 ) 2 = 260 h 

(a) SOOOOhm standard un lt s = 500 ° 260 450Q = 1.92 

Required probability = (area under normal curve to right of z = 1.92) 
= 0.5000 - 0.4726 = 0.0274 



Required probability = (area under normal curve to left of z = - 1 . 1 5) 
= 0.5000- 0.3749 = 0.1251 



SOFTWARE DEMONSTRATION OF ELEMENTARY SAMPLING THEORY 



8.17 Midwestern University has 1/3 of its students taking 9 credit hours, 1/3 taking 12 credit hours, 
and 1/3 taking 15 credit hours. If X represents the credit hours a student is taking, the distribution 
of Xis p{x)= 1/3 for x = 9, 12, and 15. Find the mean and variance of X. What type of distribu- 
tion does X have? 

SOLUTION 

The mean of X is /x = £ X p(x) = 9(1/3) + 12(1/3) + 15(1/3) = 12. The variance of X is 
a 2 = x 2 p(x) -n 2 = 81(1/3) + 144(1/3) + 225(1/3) - 144 = 150- 144 = 6. The distribution of X is 
uniform. 

8.18 List all samples of size n = 2 that are possible (with replacement) from the population in Problem 
8.17. Use the chart wizard of EXCEL to plot the sampling distribution of the mean to show that 
fa = [i, and show that a\ = a 2 /2. 

SOLUTION 



A 


B 


C 


D 


E 


F 


G 


9 


9 


mean 
9 


xbar 
9 


p(xbar) 
0.111111 


xbar x p(xbar) 
1 


xbar 2 x p(xbar) 
9 


9 


12 


10.5 


10.5 


0.222222 


2.333333333 


24.5 


9 


15 


12 


12 


0.333333 


4 


48 


12 


9 


10.5 


13.5 


0.222222 


3 


40.5 


12 


12 


12 


15 


0.111111 


1 .666666667 


25 


12 


15 


13.5 






12 


147 


15 


9 


12 










15 


12 


13.5 










15 


15 


15 
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The EXCEL worksheet shows the possible sample values in A and B and the mean in C. The 
sampling distribution of xbar is built and displayed in D and E. In C2, the function 
=AVERAGE(A2:B2) is entered and a click-and-drag is performed from C2 to CIO. Because the 
population is uniform each sample has probability 1/9 of being selected. The sample mean is 
represented by xbar. The mean of the sample mean is = S.vp(x)and is computed in F2 through 
F6. The function =SUM(F2:F6) is in F7 and is equal to 12 showing that ji s = \i. The variance of the 
sample mean is o\ = ~Ex 2 p(x) — fi 2 and is computed as follows. Y,x 2 p(x) is computed in G2 through G6. 
The function =SUM(G2:G6) is in G7 and is equal to 147. When 12 2 or 144 is subtracted from 147, 
we get 3 or a 2 = a 2 /2. Figure 8-5 shows that even with sample size 2, the sampling distribution of x is 
somewhat like a normal distribution. The larger probabilities are near 12 and they tail off to the right 
and left of 12. 




8.19 List all samples of size « = 3 that are possible (with replacement) from the population 
in Problem 8.17. Use EXCEL to construct the sampling distribution of the mean. Use 
the chart wizard of EXCEL to plot the sampling distribution of the mean. Show = y,, and 
of = y. 

SOLUTION 



A 


B 


C 


D 


E 

xbar 


F 

p(xbar) 


G 

xbar*p(xbar) 


H 

xbar A 2p(xbar) 


9 


9 


9 


9 


9 


0.037037037 


0.333333333 


3 


9 


9 


12 


10 


10 


0.111111111 


1.111111111 


11.11111111 


9 


9 


15 


11 


11 


0.222222222 


2.444444444 


26.88888889 


9 


12 


9 


10 


12 


0.259259259 


3.111111111 


37.33333333 


9 


12 


12 


11 


13 


0.222222222 


2.888888889 


37.55555556 


9 


12 


15 


12 


14 


0.111111111 


1 .555555556 


21.77777778 


9 


15 


9 


11 


15 


0.037037037 


0.555555556 


8.333333333 


9 


15 


12 


12 






12 


146 


9 


15 


15 


13 










12 


9 


9 


10 
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xbar*p(xbar) xbar A 2p(xbar) 



15 15 



15 



15 



The EXCEL worksheet shows the possible sample values in A, B, and C, the mean in D, the sampling 
distribution of xbar is computed and given in E and F. In D2, the function =AVERAGE(A2:C2) is 
entered and a click-and-drag is performed from D2 to D28. Because the population is uniform each 
sample has probability 1/27 of being selected. The sample mean is represented by xbar. The mean of 
the sample mean is = Y,xp(x) and is computed in G2 through G8. The function =SUM(G2:G8) is 
in G9 and is equal to 12 showing that ^ = \i for samples of size n =3. The variance of the sample 
mean is a\ = Hx 2 p{x) - /4 and is computed as follows. £x 2 /?(x) is computed in H2 through H8. The 
function =SUM(H2:H8) is in H9 and is equal to 146. When 12 2 or 144 is subtracted from 146 we 
get 2. Note that a\ = a 2 /3. Figure 8-6 shows the normal distribution tendency of the distribution 
of xbar. 




Fig. 8-6 Distribution of xbar for n = 3. 



CHAP. 8] 



ELEMENTARY SAMPLING THEORY 



223 



8.20 List all 81 samples of size n — 4 that are possible (with replacement) from the population in 
Problem 8.17. Use EXCEL to construct the sampling distribution of the mean. Use the chart 
wizard of EXCEL to plot the sampling distribution of the mean, show that /i ? = /x, and show that 

SOLUTION 

The method used in Problems 8.18 and 8.19 is extended to samples of size 4. From the EXCEL 
worksheet, the following distribution for xbar is obtained. In addition, it can be shown that = \x and 

o* = ^ 2 /4. 



xbar 


p(xbar) 


xbar*p(xbar) 


xbar A 2p(xbar) 


9 


0.012345679 


0.111111111 


1 


9.75 


0.049382716 


0.481481481 


4.694444444 


10.5 


0.12345679 


1 .296296296 


13.61111111 


11.25 


0.197530864 


2.222222222 


25 


12 


0.234567901 


2.814814815 


33.77777778 


12.75 


0.197530864 


2.518518519 


32.11111111 


13.5 


0.12345679 


1 .666666667 


22.5 


14.25 


0.049382716 


0.703703704 


10.02777778 


15 


0.012345679 


0.185185185 


2.777777778 




1 


12 


145.5 



The EXCEL plot of the distribution of xbar for samples of size 4 is shown in Fig. 8-7. 
0.25 -i 1 




-I 1 1 1 1 1 

9 10 11 12 13 14 15 



Fig. 8-7 Distribution of xbar for n = 4. 



Supplementary Problems 



SAMPLING DISTRIBUTION OF MEANS 



8.21 A population consists of the four numbers 3, 7, 11, and 15. Consider all possible samples of size 2 that can be 
drawn with replacement from this population. Find (a) the population mean, (b) the population standard 
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deviation, (c) the mean of the sampling distribution of means, and (d) the standard deviation of the sampling 
distribution of means. Verify parts (c) and (d) directly from (a) and (b) by using suitable formulas. 



8.22 Solve Problem 8.21 if the sampling is without replacement. 

8.23 The masses of 1500 ball bearings are normally distributed, with a mean of 22.40 g and a standard deviation 
of 0.048 g. If 300 random samples of size 36 are drawn from this population, determine the expected mean 
and standard deviation of the sampling distribution of means if the sampling is done (a) with replacement 
and (b) without replacement. 

8.24 Solve Problem 8.23 if the population consists of 72 ball bearings. 

8.25 How many of the random samples in Problem 8.23 would have their means (a) between 22.39 and 22.41 g, 
(b) greater than 22.42 g, (c) less than 22.37 g, and (d) less than 22.38 g or more than 22.41 g? 

8.26 Certain tubes manufactured by a company have a mean lifetime of 800 h and a standard deviation of 60 h. 
Find the probability that a random sample of 16 tubes taken from the group will have a mean lifetime of 
(a) between 790 and 810h, (b) less than 785 h, (c) more than 820 h, and (d) between 770 and 830 h. 

8.27 Work Problem 8.26 if a random sample of 64 tubes is taken. Explain the difference. 

8.28 The weights of packages received by a department store have a mean of 300 pounds (lb) and a standard 
deviation of 501b. What is the probability that 25 packages received at random and loaded on an elevator 
will exceed the specified safety limit of the elevator, listed as 8200 lb? 



RANDOM NUMBERS 

8.29 Work Problem 8.6 by using a different set of random numbers and selecting (a) 15. (b) 30, (c) 45, and (d) 60 
samples of size 4 with replacement. Compare with the theoretical results in each case. 

8.30 Work Problem 8.29 by selecting samples of size (a) 2 and (b) 8 with replacement, instead of size 4 with 
replacement. 

8.31 Work Problem 8.6 if the sampling is without replacement. Compare with the theoretical results. 

8.32 (a) Show how to select 30 samples of size 2 from the distribution in Problem 3.61. 

(b) Compute the mean and standard deviation of the resulting sampling distribution of means, and com- 
pare with theoretical results. 

8.33 Work Problem 8.32 by using samples of size 4. 



SAMPLING DISTRIBUTION OF PROPORTIONS 

8.34 Find the probability that of the next 200 children born, (a) less than 40% will be boys, (b) between 43% and 
57% will be girls, and (c) more than 54% will be boys. Assume equal probabilities for the births of boys and 
girls. 



8.35 Out of 1000 samples of 200 children each, in how many would you expect to find that (a) less than 40% are 
boys, (b) between 40% and 60% are girls, and (c) 53% or more are girls? 
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8.36 Work Problem 8.34 if 100 instead of 200 children are considered, and explain the differences in results. 

8.37 An urn contains 80 marbles, of which 60% are red and 40% are white. Out of 50 samples of 20 marbles, each 
selected with replacement from the urn, how many samples can be expected to consist of (a) equal numbers 
of red and white marbles, (b) 12 red and 8 white marbles, (c) 8 red and 12 white marbles, and (d) 10 or more 
white marbles? 

8.38 Design an experiment intended to illustrate the results of Problem 8.37. Instead of red and white marbles, 
you may use slips of paper on which R and W are written in the correct proportions. What errors might you 
introduce by using two different sets of coins? 

8.39 A manufacturer sends out 1000 lots, each consisting of 100 electric bulbs. If 5% of the bulbs are normally 
defective, in how many of the lots should we expect (a) fewer than 90 good bulbs and (b) 98 or more good 
bulbs? 



SAMPLING DISTRIBUTIONS OF DIFFERENCES AND SUMS 

8.40 A and B manufacture two types of cables that have mean breaking strengths of 4000 lb and 4500 lb and 
standard deviations of 3001b and 2001b, respectively. If 100 cables of brand A and 50 cables of brand B are 
tested, what is the probability that the mean breaking strength of B will be (a) at least 600 lb more than A 
and (b) at least 450 lb more than A1 

8.41 What are the probabilities in Problem 8.40 if 100 cables of both brands are tested? Account for the 
differences. 

8.42 The mean score of students on an aptitude test is 72 points with a standard deviation of 8 points. What is the 
probability that two groups of students, consisting of 28 and 36 students, respectively, will differ in their 
mean scores by (a) 3 or more points, (b) 6 or more points, and (c) between 2 and 5 points? 

8.43 An urn contains 60 red marbles and 40 white marbles. Two sets of 30 marbles each are drawn with 
replacement from the urn and their colors are noted. What is the probability that the two sets differ by 
8 or more red marbles? 

8.44 Solve Problem 8.43 if the sampling is without replacement in obtaining each set. 

8.45 Election returns showed that a certain candidate received 65% of the votes. Find the probability that two 
random samples, each consisting of 200 voters, indicated more than a 10% difference in the proportions who 
voted for the candidate. 

8.46 If U] and U 2 are the sets of numbers in Problem 8.11, verify that (a) Hu\+U2 = Mc/i + an d 
(b) ffm+m = yj°ln+° 2 u2- 

8.47 Three masses are measured as 20.48, 35.97, and 62.34 g, with standard deviations of 0.21, 0.46, and 0.54 g, 
respectively. Find the («) menu and (/>) standard deviation of the sum of the masses. 

8.48 The mean voltage of a battery is 15.0 volts (V) and the standard deviation is 0.2 V. What is the probability 
that four such batteries connected in series will have a combined voltage of 60.8 V or more? 
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8.49 The credit hour distribution at Metropolitan Technological College is as follows: 





6 


9 


12 


15 


18 


p(x) 


0.1 


0.2 


0.4 


0.2 


0.1 



Find fi and a 2 . Give the 25 (with replacement) possible samples of size 2 
probabilities. 

8.50 Refer to problem 8.49. Give and plot the probability distribution of xbar for n 

8.51 Refer to problem 8.50. Show that ^ = yu. and a\ = 

8.52 Refer to problem 8.49. Give and plot the probability distribution of xbar for n 



, their means, and their 
= 2. 

= 3. 



Statistical 
Estimation Theory 



ESTIMATION OF PARAMETERS 

In the Chapter 8 we saw how sampling theory can be employed to obtain information about samples 
drawn at random from a known population. From a practical viewpoint, however, it is often more 
important to be able to infer information about a population from samples drawn from it. Such 
problems are dealt with in statistical inference, which uses principles of sampling theory. 

One important problem of statistical inference is the estimation of population parameters, or briefly 
parameters (such as population mean and variance), from the corresponding sample statistics, or briefly 
statistics (such as sample mean and variance). We consider this problem in this chapter. 



UNBIASED ESTIMATES 

If the mean of the sampling distribution of a statistic equals the corresponding population 
parameter, the statistic is called an unbiased estimator of the parameter; otherwise, it is called a 
biased estimator. The corresponding values of such statistics are called unbiased or biased estimates, 
respectively. 



EXAMPLE 1. The mean of the sampling distribution of means \ix is /i, the population mean. Hence the sample 
mean X is an unbiased estimate of the population mean fi. 

EXAMPLE 2. The mean of the sampling distribution of variances is 

a = - - 1 a 2 
M N 

where a 2 is the population variance and N is the sample size (see Table 8.1). Thus the sample variance s 2 is a biased 
estimate of the population variance a 2 . By using the modified variance 

„ 2 _ N 2 

S ~ N-l S 

we find up = a 2 , so that .? 2 is an unbiased estimate of a 2 . However, j is a biased estimate of a. 
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In the language of expectation (see Chapter 6) we could say that a statistic is unbiased if its 
expectation equals the corresponding population parameter. Thus X and .v 2 are unbiased since 
E{X} = fi and E{s 2 } = a 1 . 



EFFICIENT ESTIMATES 

If the sampling distributions of two statistics have the same mean (or expectation), then the statistic 
with the smaller variance is called an efficient estimator of the mean, while the other statistic is called an 
inefficient estimator. The corresponding values of the statistics are called efficient and unefficient estimates. 

If we consider all possible statistics whose sampling distributions have the same mean, the one with 
the smallest variance is sometimes called the most efficient, or best, estimator of this mean. 

EXAMPLE 3. The sampling distributions of the mean and median both have the same mean, namely, the popula- 
tion mean. However, the variance of the sampling distribution of means is smaller than the variance of the sampling 
distribution of medians (see Table 8.1). Hence the sample mean gives an efficient estimate of the population mean, 
while the sample median gives an inefficient estimate of it. 

Of all statistics estimating the population mean, the sample mean provides the best (or most efficient) estimate. 

In practice, inefficient estimates are often used because of the relative ease with which some of them 
can be obtained. 



POINT ESTIMATES AND INTERVAL ESTIMATES; THEIR RELIABILITY 

An estimate of a population parameter given by a single number is called a point estimate of the 
parameter. An estimate of a population parameter given by two numbers between which the parameter 
may be considered to lie is called an interval estimate of the parameter. 

Interval estimates indicate the precision, or accuracy, of an estimate and are therefore preferable to 
point estimates. 

EXAMPLE 4. If we say that a distance is measured as 5.28 meters (m), we are giving a point estimate. If, on the 
other hand, we say that the distance is 5.28 ± 0.03 m (i.e., the distance lies between 5.25 and 5.31 m), we are giving an 
interval estimate. 

A statement of the error (or precision) of an estimate is often called its reliability. 



CONFIDENCE-INTERVAL ESTIMATES OF POPULATION PARAMETERS 

Let fi s and a s be the mean and standard deviation (standard error), respectively, of the sampling 
distribution of a statistic 5. Then if the sampling distribution of 5 is approximately normal (which as we 
have seen is true for many statistics if the sample size TV > 30), we can expect to find an actual sample 
statistic S lying in the intervals [i s — a s to [i s + a s , [i s — 2a s to + 2a s , or [i s — 3a s to /i S + 3a s about 
68.27%, 95.45%, and 99.73% of the time, respectively. 

Equivalently, we can expect to find (or we can be confident of finding) /j, s in the intervals S — a s to 
S + a s , S - 2cr s to S + 2a s , or S - 3a s to S + 3a s about 68.27%, 95.45%, and 99.73% of the time, 
respectively. Because of this, we call these respective intervals the 68.27%, 95.45%, and 99.73% con- 
fidence intervals for estimating /i s . The end numbers of these intervals (S ± a s , S ± 20-5, and 5 ± 30-5) are 
then called the 68.27%, 95.45%, and 99.73% confidence limits, or fiducial limits. 

Similarly, S± 1.96a s and S±2.58cr s are the 95% and 99% (or 0.95 and 0.99) confidence limits 
for S. The percentage confidence is often called the confidence level. The numbers 1.96, 2.58, etc., in 
the confidence limits are called confidence coefficients, or critical rallies, and are denoted by z c . From 
confidence levels we can find confidence coefficients, and vice versa. 
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Table 9.1 shows the values of z c corresponding to various confidence levels used in practice. For 
confidence levels not presented in the table, the values of z c can be found from the normal-curve area 
tables (see Appendix II). 



Confidence level 



Confidence Intervals for Means 

If the statistic 5 is the sample mean X, then the 95% and 99% confidence limits for estimating the 
population mean \x, are given by X± 1.96ojp an d X±2.58ax, respectively. More generally, the con- 
fidence limits are given by X ±z c a^, where z c (which depends on the particular level of confidence 
desired) can be read from Table 9.1. Using the values of obtained in Chapter 8, we see that the 
confidence limits for the population mean are given by 



X ± z c 



VN 



(J) 



if the sampling is either from an infinite population or with replacement from a finite population, and are 
given by 



(2) 



if the sampling is without replacement from a population of finite size N p . 

Generally, the population standard deviation a is unknown; thus, to obtain the above confidence 
limits, we use the sample estimate s or s. This will prove satisfactory when TV > 30. For N < 30, the 
approximation is poor and small sampling theory must be employed (see Chapter 11). 



Confidence Intervals for Proportions 

If the statistic S is the proportion of "successes" in a sample of size TV drawn from a binomial 
population in which p is the proportion of successes (i.e., the probability of success), then the confidence 
limits for p are given by P ± z c a P , where P is the proportion of successes in the sample of size N. Using 
the values of o> obtained in Chapter 8, we see that the confidence limits for the population proportion 
are given by 

if the sampling is either from an infinite population or with replacement from a finite population and are 
given by 




(4) 



if the sampling is without replacement from a population of finite size N p . 

To compute these confidence limits, we can use the sample estimate P for p, which will generally 
prove satisfactory if TV > 30. A more exact method for obtaining these confidence limits is given in 
Problem 9.12. 
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Confidence Intervals for Differences and Sums 

If Si and S 2 are two sample statistics with approximately normal sampling distributions, confidence 
limits for the difference of the population parameters corresponding to Si and 5*2 are given by 

Si-S 2 ± z c a Si _ Si = Si-S 2 ± z c ^a 2 Si +<j 2 Si (5) 

while confidence limits for the sum of the population parameters are given by 

Si+S 2 ±z c a Sl+S2 = Si + S 2 ± z c yj^+a 2 ^ (6) 

provided that the samples are independent (see Chapter 8). 

For example, confidence limits for the difference of two population means, in the case where the 
populations are infinite, are given by 

Xi - X 2 ± z e a 2l _f 2 =Xi-X 2 ± z c (7) 

where X u a x , Ni and X 2 , a 2 , N 2 are the respective means, standard deviations, and sizes of the two 
samples drawn from the populations. 

Similarly, confidence limits for the difference of two population proportions, where the populations 
are infinite, are given by 

Pi-P 2 ± z c a Pl _ P2 =Pi-P 2 ± z c f^^^ (8) 

where Pi and P 2 are the two sample proportions, N { and N 2 are the sizes of the two samples drawn from 
the populations, and p t and p 2 are the proportions in the two populations (estimated by Pi and P 2 ). 

Confidence Intervals for Standard Deviations 

The confidence limits for the standard deviation a of a normally distributed population, as estimated 
from a sample with standard deviation s, are given by 

.v ± z c a. = s±z c — !L= (9) 
V2N 

using Table 8.1. In computing these confidence limits, we use s or s to estimate a. 



PROBABLE ERROR 

The 50% confidence limits of the population parameters corresponding to a statistic S are given by 
5 ± 0.6745(75. The quantity 0.6745(7,5 is known as the probable error of the estimate. 
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Solved Problems 



UNBIASED AND EFFICIENT ESTIMATES 

9.1 Give an example of estimators (or estimates) that are (a) unbiased and efficient, (b) unbiased and 
inefficient, and (c) biased and inefficient. 

SOLUTION 

(a) The sample mean X and the modified sample variance 

S N-l" 

are two such examples. 

(b) The sample median and the sample statistic \ {Q\ + Q 3 ), where gi an d Qi are the lower and upper 
sample quartiles, are two such examples. Both statistics are unbiased estimates of the population mean, 
since the mean of their sampling distributions is the population mean. 

(c) The sample standard deviation s, the modified standard deviation s, the mean deviation, and the 
semi-interquartile range are four such examples. 

9.2 In a sample of five measurements, the diameter of a sphere was recorded by a scientist as 6.33, 
6.37, 6.36, 6.32, and 6.37 centimeters (cm). Determine unbiased and efficient estimates of 
(a) the true mean and (b) the true variance. 

SOLUTION 

(a) The unbiased and efficient estimate of the true mean (i.e., the populations mean) is 

- _ |]Z _ 6.33 + 6.37 + 6.36 + 6.32 + 6.37 _ 6 35 cm 
~~ N ~ 5 ~~ 

(b) The unbiased and efficient estimate of the true variance (i.e., the population variance) is 

?2 = i = H^-X) 2 

N- 1 N — \ 

_ (6.33 - 6.35) 2 + (6.37 - 6.35) 2 + (6.36 - 6.35) 2 + (6.32 - 6.35) 2 + (6.37 - 6.35) 2 
5-1 

= 0.00055 cm 2 

Note that although s = V0. 00055 = 0.023 cm is an estimate of the true standard deviation, this 
estimate is neither unbiased nor efficient. 

9.3 Suppose that the heights of 100 male students at XYZ University represent a random sample of 
the heights of all 1546 students at the university. Determine unbiased and efficient estimates of 
(a) the true mean and (b) the true variance. 

SOLUTION 

(a) From Problem 3.22, the unbiased and efficient estimate of the true mean height is X = 67.45 inches (in). 

(b) From Problem 4.17, the unbiased and efficient estimate of (lie true v. 



Thus s = V8.6136 = 2.93 in. Note that since N is large, there is essentially no difference between s and 
.v 2 or between s and i. 
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Note that we have not used Sheppard's correction for grouping. To take this into account, 
we would use s = 2.79 in (see Problem 4.21). 



9.4 Give an unbiased and inefficient estimate of the true mean diameter of the sphere of Problem 9.2. 



The median is one example of an unbiased and inefficient estimate of the population mean. For the five 
measurements arranged in order of magnitude, the median is 6.36 cm. 



CONFIDENCE INTERVALS FOR MEANS 

9.5 Find the (a) 95% and (b) 99% confidence intervals for estimating the mean height of the XYZ 
University students in Problem 9.3. 



(a) The 95% confidence limits are X ± \ .96a/y/N. Using X = 67.45 in, and s = 2.93 in as an estimate of 
a (see Problem 9.3), the confidence limits are 67.45 ± 1.96(2.93/-/l00), or 67.45 ± 0.57 in. Thus the 
95% confidence interval for the population mean \x is 66.88 to 68.02 in, which can be denoted 
by 66.88 < fj, < 68.02. 

We can therefore say that the probability that the population mean height lies between 66.88 and 
68.02 in is about 95%, or 0.95. In symbols we write Pr{66.88 < /i < 68.02} = 0.95. This is equivalent to 
saying that we are 95% confident that the population mean (or true mean) lies between 66.88 and 
68.02 in. 



(b) The 99% confidence limits are X ± 2.58a/ Vn = X ± 2.58.5/%/^ = 67.45 ± 2.58(2.93/ v / l00) = 
67.45 ±0.76in. Thus the 99% confidence interval for the population mean /i is 66.69 to 68.21 in, 
which can be denoted by 66.69 < \i < 68.21. 

In obtaining the above confidence intervals, we assumed that the population was infinite or so large 
that we could consider conditions to be the same as sampling with replacement. For finite populations 
where sampling is without replacement, we should use 



to be essentially 1 .0, and thus it need not be used. If it is used, the above confidence limits become 
67.45 ± 0.56 in and 67.45 ± 0.73 in, respectively. 



9.6 Blaises' Christmas Tree Farm has 5000 trees that are mature and ready to be cut and sold. 
One-hundred of the trees are randomly selected and their heights measured. The heights in inches 
are given in Table 9.2. Use MINITAB to set a 95% confidence interval on the mean height of all 
5000 trees. If the trees sell for $2.40 per foot, give a lower and an upper bound on the value of the 
5000 trees. 



SOLUTION 



SOLUTION 




However, we can consider the factor 




SOLUTION 



The MINITAB confidence interval given below indicates that the mean height for the 5000 trees could 
be as small as 57.24 or as large as 61.20 inches. The total number of inches for all 5000 trees ranges between 
(57.24)(5000) = 286,2000 and (61.20)(5000) = 306,000. If the trees sell for $2.40 per foot, then cost 
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per inch is $0.2. The value of the trees ranges between (286,000) (0.2) = $57,200 and (306,000) (0.2) = 
$61,200 with 95% confidence. 



67 59 53 55 



47 47 55 73 



71 69 69 44 
57 75 55 66 



62 52 50 63 



MTB > standard deviation 
Column Standard Deviation 

Standard deviation o: 
MTB > zinterval 95 pe: 

Confidence Intervals 

The assumed sigma = 11 
Variable N Me 



light = 10. Ill 

it confidence sd = 10. Ill dal 



100 



StDev 
10. 11 



SE Mean 95 .0 % CI 

1.01 (57.24, 61.20) 



9.7 In a survey of Catholic priests each priest reported the total number of baptisms, marriages, and 
funerals conducted during the past calendar year. The responses are given in Table 9.3. Use this 
data to construct a 95% confidence interval on y, the mean number of baptisms, marriages, 



STATISTICAL ESTIMATION THEORY 



and funerals conducted during the past calendar year per priest for all priests. Construct the 
interval by use of the confidence interval formula, and also use the Z interval command of 
MINITAB to find the interval. 

SOLUTION 

After entering the data from Table 9.3 into column 1 of MINITAB'S worksheet, and naming the 
column 'number', the mean and standard deviation commands were given. 

MTB > mean cl 

Column Mean 

Mean of Number = 40.261 
MTB > standard deviation cl 

Column Standard Deviation 

Standard deviation of Number = 9 . 9895 

The standard error of the mean is equal to 9.9895/\/50 = 1.413, the critical value is 1.96, and the 95% 
margin of error is 1.96(1.413) = 2.769. The confidence interval extends from 40.261 - 2.769 = 37.492 to 
40.261 +2.769 = 43.030. 

The Z interval command produces the following output. 

MTB > Zinterval 95% confidence sd = 9.9895 data in cl 
Z Confidence Intervals 

The assumed sigma = 9.99 

Variable N Mean StDev SE Mean 95.00 % CI 
Number 50 40.26 9.99 1.41 (37.49,43.03) 

We are 95% confident that the true mean for all priests is between 37.49 and 43.03. 



9.8 In measuring reaction time, a psychologist estimates that the standard deviation is 0.05 seconds 
(s). How large a sample of measurements must he take in order to be (a) 95% and (b) 99% 
confident that the error of his estimate will not exceed 0.01 s? 

SOLUTION 

(a) The 95% confidence limits are X ± 1.96cr/\/jV, the error of the estimate being \.96a/-/N. Taking 
a = .s = 0.05s, we see that this error will be equal to 0.01s if (1.96)(0.05)/v^V = 0.01; that is, 
V / iV= (1.96)(0.05)/0.01 = 9.8, or N = 96.04. Thus we can be 95% confident that the error of the 
estimate will be less than 0.01 s if N is 97 or larger. 
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Another method 



( L96 )( - 05 ) <0 01 if jff >_L or ^ > (l-96)(0.05) = 

(1.96)(0.05) - 0.01 V - 0.01 

Then N > 96.04, or N > 97. 
(b) The 99% confidence limits are X ± 2.58a/ x/iV. Then (2.58)(0.05)/^ = 0.01, or N = 166.4. Thus we 
can be 99% confident that the error of the estimate will be less than 0.01 s only if N is 167 or larger. 



9.9 A random sample of 50 mathematics grades out of a total of 200 showed a mean of 75 and a 
standard deviation of 10. 

(a) What are the 95% confidence limits for estimates of the mean of the 200 grades? 

(b) With what degree of confidence could we say that the mean of all 200 grades is 75 ± 1? 
SOLUTION 

(a) Since the population size is not very large compared with the sample size, we must adjust for it. Then 
the 95% confidence limits are 

X± \.96a x =X± 1.96 4= 

(b) The confidence limits can be represented by 



X ± z, a x = X ± z, 



Since this must equal 75 ± 1, we have 1.23z f = 1, or z c = 0.81. The area under the normal curve from 
z = to z = 0.81 is 0.2910; hence the required degree of confidence is 2(0.2910) = 0.582, or 58.2%. 



CONFIDENCE INTERVALS FOR PROPORTIONS 

9.10 A sample poll of 100 voters chosen at random from all voters in a given district indicated that 
55% of them were in favor of a particular candidate. Find the (a) 95%, (b) 99%, and (c) 99.73% 
confidence limits for the proportion of all the voters in favor of this candidate. 

SOLUTION 

(a) The 95% c onfidence limits for the population p are P± 1.96o> = P± 1.96^(1 —p)/N = 
0.55 ± 1.96 v'(0.55)(0.45)/100 = 0.55 ± 0.10, where we have used the sample proportion P to 
estimate p. 

(b) The 99% confidence limits for p are 0.55 ± 2.58 ,/(0.55)(0.45)/100 = 0.55 ± 0.13. 

(c) The 99.73% confidence limits for p are 0.55 ± 3 v /(0.55)(0.45)/100 = 0.55 ± 0.15. 



9.11 How large a sample of voters should we take in Problem 9.10 in order to be (a) 95% and 
(b) 99.73% confident that the candidate will be elected? 

SOLUTION 

The confidence limits for p are P ± z c y>(l -p)/N = 0.55 ± z c v /(0.55)(0.45)/iV = 0.55 ± 0.50z c /^/N, 
where we have used the estimate P = p = 0.55 on the basis of Problem 9.10. Since the candidate will 
win only if she receives more than 50% of the population's votes, we require that Q.5QzJ\/N be less 
than 0.05. 
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(a) For 95% confidence, 0.50z c /VN = 0.50(1. 96)/%/^ = 0.05 when N= 384.2. Thus N should be 
at least 385. 

(b) For 99.73% confidence, O.SOzJ^N = 0.50(3) /y/N = 0.05 when N = 900. Thus N should be at least 901 . 
Another method 

1.50/V^V < 0.05 when x/iV/1.50 > 1/0.05 or Vn > 1.50/0.05. Then \fN > 30 or TV > 900, so that 
N should be at least 901. 



9.12 A survey is conducted and it is found that 156 out of 500 adult males are smokers. Use the 
software package STATISTIX to set a 99% confidence interval on p, the proportion in the 
population of adult males who are smokers. Check the confidence interval by computing it 
by hand. 



SOLUTION 

The STATISTIX output is a 



follows. The 99% confidence interval is shown in bold. 



One-Sample Proporti< 



Sample Si 
Successe: 
Proportic 



500 
156 
.31200 



Null Hypothesis : P = 0.5 
Alternative Hyp : P < > . 5 



Difference 
Standard Error 
icted) 
iCted) 



0.02072 
-8.41 
-8.36 



. 0000 
0.0000 



Uncorrected 

Corrected 



99% Confidence Interval 
(0.25863, 0.36537) 

(0.25763, 0.36637) 



We are 99% confident that the true percent of adult male smokers is between 25.9% and 36.5%. 



P = 0.312, z c = 2.58, 



P±z c y N or °- 312 ± 2-58(0.0207) or (0.258, 0.365) This is the s; 
e package STATISTIX. 



s given above by the soft- 



9.13 Refer to Problem 9.12. Set a 99% confidence interval on p using the software package MINITAB. 
SOLUTION 

The 99% confidence interval is shown in bold below. It is the same as the STATISTIX confidence 
interval shown in Problem 9.12. 



Sample X N Sample P 99% CI z-Value P-Value 

1 156 500 0.312000 (0.258629, 0.365371) -8.41 0.000 
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CONFIDENCE INTERVALS FOR DIFFERENCES AND SUMS 



9.14 A study was undertaken to compare the mean time spent on cell phones by male and female 
college students per week. Fifty male and 50 female students were selected from Midwestern 
University and the number of hours per week spent talking on their cell phones determined. 
The results in hours are shown in Table 9.4. Set a 95% confidence interval on - fi 2 using 
MINITAB. Check the results by calculating the interval by hand. 

Table 9.4 



SOLUTION 

When both samples exceed 30, the two sample t test and the z test may be used interchangeably since the 
t distribution and the z distribution are very similar. 

Two-sample T for males vs females 

N Mean StDev SE Mean 

males 50 9.82 2.15 0.30 

females 50 9.70 1.78 0.25 

Difference = mu (males) - mu (females) 
Estimate for difference : 0.120000 
95% CI for difference: (-0.663474, 0.903474) 

T-Test of difference = (vs not =) : T-Value = 0.30 P-Value = 0.762 
DF = 98 

Both use Pooled StDev = 1 .9740 

According to the MINITAB output, the difference in the population means is between -0.66 and 0.90. 
There is a good chance that there is no difference in the population means. 

Check: 



The formula for a 95% confidence interval is (x t - x 2 ) ± zA J(s\/n\ ) + (A/n 2 ) J • Substituting, we 
obtain 0. 12 ± 1.96(0.395) and get the answer given by MINITAB. 



9.15 Use STAT1STIX and SPSS to solve Problem 9.14. 
SOLUTION 



The STATISTIX solution is given below. Note that the 95% confidence interval is the same as that in 
Problem 9.14. We shall comment later on why we assume equal variances. 



STATISTICAL ESTIMATION THEORY 



Two-Sample T Tests for males vs females 
Variable Mean N SD SE 

males 9.8200 50 2.1542 0.3046 

females 9.7000 50 1.7757 0.2511 

Difference . 1200 



Null Hypothesis: difference = 
Alternative Hyp : dif ference < 

Assumption T 
Equal Variances 0.30 9 

Unequal Variances 0.3 9 



0.7618 

0.7618 



95% CI for Difference 

Lower Upper 
-0.6635 0.9035 

-0.6638 0.9038 



Test for Equality F 

of Variances 1.47 
The SPSS solution is as follows: 



Group Statistics 



Std. Deviation 



Std. Error 
Mean 



1.00 
2.00 



9.7000 
9.8200 



1 .77569 
2.15416 



.25112 
.30464 



Independent Samples Test 





Levene's Test for 
Equality of 
Variances 


t-test for Equality of Means 












Sig. 




Std. Error 


95% Confidence 
Interval of the 
Difference 




F 


Sig. 




df 


(2-tailed) 


Difference 


Difference 




Upper 


time Equal variances 
assumed 




.346 


-.304 




.762 


-.12000 


.39480 


-.90347 


.66347 


not assumed 






-.304 


94.556 


.762 


-.12000 


.39480 


-.90383 


.66383 



9.16 Use SAS to work Problem 9.14. Give the forms of the data files that SAS allows for the analysis. 



The SAS analysis is as follows. The confidence interval is shown in bold at the bottom of the 
following output. 
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Two Sample t-test for the Means of males and females 
Sample Statistics 
Group N Mean Std. Dev. Std. Error 



males 50 9.82 2.1542 0.3046 

females 50 9.7 1.7757 0.2511 

Null hypothesis: Mean 1- Mean 2 = 
Alternative: Mean 1- Mean 2 

If Variances Are t statistic Df Pr > t 



Equal 0.304 98 0.7618 

Not Equal 0.304 94.56 0.7618 

95% Confidence Interval for the Difference between Two Means 

Lower Limit Upper Limit 



-0.66 0.90 

The data file used in the SAS analysis may have the data for males and females in separate columns or the 
data may consist of hours spent on the cell phone in one column and the sex of the person (Male or Female) 
in the other column. Male and female may be coded as 1 or 2. The first form consists of 2 columns and 
50 rows. The second form consists of 2 columns and 100 rows. 



CONFIDENCE INTERVALS FOR STANDARD DEVIATIONS 

9.17 A confidence interval for the variance of a population utilizes the chi-square distribution. The 
(1 - a) x 100% confidence interval is ( "~'^ ) < a 2 < jp-^y, where n is the sample size, S 2 is the 
sample variance, Xo/2 ar) d Xa-o/2 come from the chi-square distribution with (n - 1) degrees of 
freedom. Use EXCEL software to find a 99% confidence interval for the variance of twenty 180 
ounce containers. The data from twenty containers is shown in Table 9.5. 



Table 9.5 



181.5 


180.8 


179.7 


182.4 


178.7 


178.5 


183.9 


182.2 


179.7 


180.9 


180.6 


181.4 


180.4 


181.4 


178.5 


180.6 


178.8 


180.1 


181.3 


182.2 



SOLUTION 

The EXCEL worksheet is as follows. The data are in A1:B10. Column D shows the function in column 
C which gives the values shown. 
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A 


B 


C 


D 


181.5 


180.8 


2.154211 


=VAR(A1:B10) 


179.1 


182.4 


40.93 


=19*C1 


178.7 


178.5 


38.58226 


=CHIINV(0.005,19) 


183.9 


182.2 


6.843971 


=CHIINV(0.995,19) 


179.7 


180.9 






180.6 


181.4 


1 .06085 


=C2/C3 


180.4 


181.4 


5.980446 


=C2/C4 


178.5 


180.6 






178.8 


180.1 






181.3 


182.2 







The following is a 99% confidence interval for a 1 : (1.06085 < a 2 < 5.980446).The following is a 
99% confidence interval for a : (1.03, 2.45). 

Note that =VAR(A1:B10) gives S 2 , =CHIINV(0.005,19) is the chi-square value with area 0.005 to its 
right, and =CHIINV(0.995,19) is the chi-square value with area 0.995 to its right. In all cases, the chi-square 
distribution has 19 degrees of freedom. 



9.18 When comparing the variance of one population with the variance of another population, the 
following (1 - a) x 100% confidence interval may be used: 



where n x and n 2 are the two sample sizes, S x and S 2 are the two sample variances. 
v { = n x - 1 and v 2 = n 2 - 1 are the numerator and denominator degrees of freedom for the F distribution 
and the F values are from the F distribution. Table 9.6 gives the number of e-mails sent per week by 
employees at two different companies. 

Set a 95% confidence interval for — . 



Table 9.6 



Company 1 


Company 2 


81 


99 


104 


100 


115 


104 


111 

85 


98 
103 


121 


113 


95 


95 


112 


107 


100 


98 


117 


95 


113 


101 


109 


109 


101 


99 




93 




105 
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SOLUTION 

The EXCEL worksheet is shown below. Column D shows the function in column C which gives the 
values shown. The two sample variances are computed in CI and C2. The F values are computed in C3 and 
C4. The end points of the confidence interval for the ratio of the variances are computed in C5 and C6. We 

see that a 95% confidence interval for ^ is (1.568, 15.334). A 95% confidence interval for — is (1.252, 3.916). 
Note that =FINV(0. 025, 12,14) is the point associated with the F distribution with v x = \2 and v 2 =14 
degrees of freedom so that 0.025 of the area is to the right of that point. 



A 


B 


C 


D 


Company 1 


Company 2 


148.5769231 


=VAR(A2:A14) 


81 


99 


31 .06666667 


=VAR(B2:B16) 


104 


100 


3.050154789 


=FINV(0.025,12,14) 


115 


104 


3.2062117 


=FINV(0.025,14,12) 


111 


98 


1.567959436 


=(C1/C2)/C3 


85 


103 


15.33376832 


=(C1/C2)*C4 


121 


113 






95 


95 


1.25218187 


=SQRT(C5) 


112 


107 


3.915835584 


=SQRT(C6) 


100 


98 






117 


95 






113 


101 






109 


109 






101 


99 
93 
105 







PROBABLE ERROR 



9.19 The voltages of 50 batteries of the same type have a mean of 18.2 volts (V) and a standard 
deviation of 0.5 V. Find (a) the probable error of the mean and (b) the 50% confidence limits. 

SOLUTION 

(a) 



Note that if the standard deviation of 0.5 V is computed as .5, the probable error is 
0. 6745(0. 5/v^50) = 0.048 also, so that either estimate can be used if N is large enough. 



(b) The 50% confidence limits a 



9.20 A measurement was recorded as 216.480 grams (g) with a probable error of 0.272 g. What are the 
95% confidence limits for the measurement? 
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SOLUTION 

The probable error is 0.272 = 0.6745o>, or a x = 0.272/0.6745. Thus the 95% confidence limits are 
X ± 1.96^ = 216.480 ± 1.96(0.272/0.6745) = 216.480 ± 0.790 g. 



Supplementary Problems 

UNBIASED AND EFFICIENT ESTIMATES 

9.21 Measurements of a sample of masses were determined to be 8.3, 10.6, 9.7, 8.8, 10.2, and 9.4 kilograms (kg), 
respectively. Determine unbiased and efficient estimates of (a) the population mean and (b) the 
population variance, and (c) compare the sample standard deviation with the estimated population standard 
deviation. 

9.22 A sample of 10 television tubes produced by a company showed a mean lifetime of 1200 hours (h) and a 
standard deviation of 100 h. Estimate (a) the mean and (b) the standard deviation of the population of all 
television tubes produced by this company. 

9.23 (a) Work Problem 9.22 if the same results are obtained for 30, 50, and 100 television tubes. 

(b) What can you conclude about the relation between sample standard deviations and estimates of 
population standard deviations for different sample sizes? 

CONFIDENCE INTERVALS FOR MEANS 



9.24 The mean and standard deviation of the maximum loads supported by 60 cables (see Problem 3.59) are given 
by 11.09 tons and 0.73 ton, respectively. Find the (a) 95% and (b) 99% confidence limits for the mean of the 
maximum loads of all cables produced by the company. 

9.25 The mean and standard deviation of the diameters of a sample of 250 rivet heads manufactured by a 
company are 0.72642 in and 0.00058 in, respectively (see Problem 3.61). Find the (a) 99%, (b) 98%, 
(c) 95%, and (d) 90% confidence limits for the mean diameter of all the rivet heads manufactured by 
the company. 

9.26 Find (a) the 50% confidence limits and (b) the probable error for the mean diameters in Problem 9.25. 

9.27 If the standard deviation of the lifetimes of television tubes is estimated to be lOOh, how large a sample 
must we take in order to be (a) 95%, (b) 90%, (c) 99%, and (d) 99.73% confident that the error in the 
estimated mean lifetime will not exceed 20 h? 

9.28 A group of 50 internet shoppers were asked how much they spent per year on the Internet. Their responses 
are shown in Table 9.7. 

Find an 80% confidence interval for n, the mean amount spent by all internet shoppers, using the 
equations found in Chapter 9 as well as statistical software. 
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9.29 A company has 500 cables. A test of 40 cables selected at random showed a mean breaking strength of 2400 
pounds (lb) and a standard deviation of 1501b. 

(a) What are the 95% and 99% confidence limits for estimating the mean breaking strength of the 
remaining 460 cables? 

(b) With what degree of confidence could we say that the mean breaking strength of the remaining 
460 cables is 2400 ±35 lb? 



CONFIDENCE INTERVALS FOR PROPORTIONS 



9.30 An urn contains an unknown proportion of red and white marbles. A random sample of 60 marbles selected 
with replacement from the urn showed that 70% were red. Find the (a) 95%, (b) 99%, and (c) 99.73% 
confidence limits for the actual proportion of red marbles in the urn. 



9.31 A poll of 1000 individuals over the age of 65 years was taken to determine the percent of the population in 
this age group who had an Internet connection. It was found that 387 of the 1000 had an internet connection. 
Using the equations in the book as well as statistical software, find a 97.5% confidence interval for p. 

9.32 It is believed that an election will result in a very close vote between two candidates. What is the least number 
of voters that one should poll in order to be (a) 80%, (b) 90%, (c) 95%, and (d) 99% confident of a decision 
in favor of either one of the candidates? 



CONFIDENCE INTERVALS FOR DIFFERENCES AND SUMS 



9.33 Of two similar groups of patients, A and B, consisting of 50 and 100 individuals, respectively, the first was 
given a new type of sleeping pill and the second was given a conventional type. For the patients in group A, 
the mean number of hours of sleep was 7.82 with a standard deviation of 0.24 h. For the patients in group B, 
the mean number of hours of sleep was 6.75 with a standard deviation of 0.30 h. Find the (a) 95% and 
(b) 99% confidence limits for the difference in the mean number of hours of sleep induced by the two types 
of sleeping pills. 



9.34 A study was conducted to compare the mean lifetimes of males with the mean lifetimes of females. Random 
samples were collected from the obituary pages and the data in Table 9.8 were collected. 
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Using the above data, the equations in the book, and statistical software, set an 85% confidence interval 

,n UMALE _ ^FEMALE- 



9.35 Two areas of the country are compared as to the percent of teenagers who have at least one cavity. One area 
fluoridates its water and the other doesn't. The sample from the non-fluoridated area finds that 425 out of 
1000 have at least one cavity. The sample from the fluoridated area finds that 376 out of 1000 have at least 
one cavity. Set a 99% confidence interval on the difference in percents using the equations in the book as well 
as statistical software. 



CONFIDENCE INTERVALS FOR STANDARD DEVIATIONS 



9.36 The standard deviation of the breaking strengths of 100 cables tested by a company was 1801b. Find the 
(a) 95%, (b) 99%, and (c) 99.73% confidence limits for the standard deviation of all cables produced by the 
company. 



9.37 Solve Problem 9.17 using S AS. 



9.38 Solve Problem 9.18 using SAS. 




Statistical 
Decision Theory 




STATISTICAL DECISIONS 

Very often in practice we are called upon to make decisions about populations on the basis of sample 
information. Such decisions are called statistical decisions. For example, we may wish to decide on the 
basis of sample data whether a new serum is really effective in curing a disease, whether one educational 
procedure is better than another, or whether a given coin is loaded. 



STATISTICAL HYPOTHESES 

In attempting to reach decisions, it is useful to make assumptions (or guesses) about the populations 
involved. Such assumptions, which may or may not be true, are called statistical hypotheses. They are 
generally statements about the probability distributions of the populations. 



Null Hypotheses 

In many instances we formulate a statistical hypothesis for the sole purpose of rejecting or 
nullifying it. For example, if we want to decide whether a given coin is loaded, we formulate the 
hypothesis that the coin is fair (i.e., p = 0.5, where p is the probability of heads). Similarly, if we 
want to decide whether one procedure is better than another, we formulate the hypothesis that there 
is no difference between the procedures (i.e., any observed differences are due merely to fluctuations 
in sampling from the same population). Such hypotheses are often called null hypotheses and are 
denoted by H . 



Alternative Hypotheses 

Any hypothesis that differs from a given hypothesis is called an alternative hypothesis. For example, 
if one hypothesis is = 0.5, alternative hypotheses might be p = 0.7, p ^ 0.5, or p > 0.5. A hypothesis 
alternative to the null hypothesis is denoted by H { . 
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TESTS OF HYPOTHESES AND SIGNIFICANCE, OR DECISION RULES 

If we suppose that a particular hypothesis is true but find that the results observed in a random 
sample differ markedly from the results expected under the hypothesis (i.e., expected on the basis of pure 
chance, using sampling theory), then we would say that the observed differences are significant and would 
thus be inclined to reject the hypothesis (or at least not accept it on the basis of the evidence obtained). 
For example, if 20 tosses of a coin yield 16 heads, we would be inclined to reject the hypothesis that the 
coin is fair, although it is conceivable that we might be wrong. 

Procedures that enable us to determine whether observed samples differ significantly from the results 
expected, and thus help us decide whether to accept or reject hypotheses, are called tests of hypotheses, 
tests of significance, rules of decision, or simply decision rules. 



TYPE I AND TYPE II ERRORS 

If we reject a hypothesis when it should be accepted, we say that a Type I error has been made. If, on 
the other hand, we accept a hypothesis when it should be rejected, we say that a Type II error has been 
made. In either case, a wrong decision or error in judgment has occurred. 

In order for decision rules (or tests of hypotheses) to be good, they must be designed so as to 
minimize errors of decision. This is not a simple matter, because for any given sample size, an attempt 
to decrease one type of error is generally accompanied by an increase in the other type of error. In 
practice, one type of error may be more serious than the other, and so a compromise should be reached 
in favor of limiting the more serious error. The only way to reduce both types of error is to increase the 
sample size, which may or may not be possible. 



LEVEL OF SIGNIFICANCE 

In testing a given hypothesis, the maximum probability with which we would be willing to risk a 
Type I error is called the level of significance, or significance level, of the test. This probability, often 
denoted by a, is generally specified before any samples are drawn so that the results obtained will not 
influence our choice. 

In practice, a significance level of 0.05 or 0.01 is customary, although other values are used. If, for 
example, the 0.05 (or 5%) significance level is chosen in designing a decision rule, then there are about 
5 chances in 100 that we would reject the hypothesis when it should be accepted; that is, we are about 95% 
confident that we have made the right decision. In such case we say that the hypothesis has been rejected 
at the 0.05 significance level, which means that the hypothesis has a 0.05 probability of being wrong. 



TESTS INVOLVING NORMAL DISTRIBUTIONS 

To illustrate the ideas presented above, suppose that under a given hypothesis the sampling dis- 
tribution of a statistic S is a normal distribution with mean [i s and standard deviation <r s . Thus the 
distribution of the standardized variable (or z score), given by z = (S - ^ s )/a s , is the standardized 
normal distribution (mean 0, variance 1), as shown in Fig. 10-1. 

As indicated in Fig. 10-1, we can be 95% confident that if the hypothesis is true, then the z score of 
an actual sample statistic 5 will lie between —1.96 and 1.96 (since the area under the normal curve 
between these values is 0.95). However, if on choosing a single sample at random we find that the z score 
of its statistic lies outside the range —1.96 to 1.96, we would conclude that such an event could happen 
with a probability of only 0.05 (the total shaded area in the figure) if the given hypothesis were true. We 
would then say that this z score differed significantly from what would be expected under the hypothesis, 
and we would then be inclined to reject the hypothesis. 

The total shaded area 0.05 is the significance level of the test. It represents the probability of our 
being wrong in rejecting the hypothesis (i.e., the probability of making a Type I error). Thus we say that 
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z=-1.96 z=1.96 



Fig. 10-1 Standard normal curve with the critical region (0.05) and acceptance region (0.95). 

the hypothesis is rejected at 0.05 the significance level or that the z score of the given sample statistic is 
significant at the 0.05 level. 

The set of z scores outside the range —1.96 to 1.96 constitutes what is called the critical region of the 
hypothesis, the region of rejection of the hypothesis, or the region of significance. The set of z scores inside 
the range —1.96 to 1.96 is thus called the region of acceptance of the hypothesis, or the region of non- 
significance. 

On the basis of the above remarks, we can formulate the following decision rule (or test of hypoth- 
esis or significance): 

Reject the hypothesis at the 0.05 significance level if the z score of the statistic S lies outside the range 
-1.96 to 1.96 (i.e., either z > 1.96 or z < -1.96). This is equivalent to saying that the observed sample 
statistic is significant at the 0.05 level. 

Accept the hypothesis otherwise (or, if desired, make no decision at all). 

Because the z score plays such an important part in tests of hypotheses, it is also called a test statistic. 

It should be noted that other significance levels could have been used. For example, if the 0.01 level 
were used, we would replace 1.96 everywhere above with 2.58 (see Table 10.1). Table 9.1 can also be 
used, since the sum of the significance and confidence levels is 100%. 



Table 10.1 



Level of significance, a 


0.10 


0.05 


0.01 


0.005 


0.002 


Critical values of z for 


-1.28 
or 1.28 


-1.645 
or 1.645 


-2.33 
or 2.33 


-2.58 
or 2.58 


-2.88 
or 2.88 


Critical values of z for 
two-tailed tests 


-1.645 
and 1.645 


-1.96 
and 1.96 


-2.58 
and 2.58 


-2.81 
and 2.81 


-3.08 
and 3.08 



TWO-TAILED AND ONE-TAILED TESTS 

In the above test we were interested in extreme values of the statistic S or its corresponding z score 
on both sides of the mean (i.e., in both tails of the distribution). Such tests are thus called two-sided tests. 
or two-tailed tests. 

Often, however, we may be interested only in extreme values to one side of the mean (i.e., in one tail 
of the distribution), such as when we are testing the hypothesis that one process is better than another 
(which is different from testing whether one process is better or worse than the other). Such tests are 



248 



STATISTICAL DECISION THEORY 



[CHAP. 10 



called one-sided tests, or one-tailed tests. In such cases the critical region is a region to one side of the 
distribution, with area equal to the level of significance. 

Table 10.1, which gives critical values of z for both one-tailed and two-tailed tests at various levels of 
significance, will be found useful for reference purposes. Critical values of z for other levels of signifi- 
cance are found from the table of normal-curve areas (Appendix II). 



SPECIAL TESTS 

For large samples, the sampling distributions of many statistics are normal distributions (or at least 
nearly normal), and the above tests can be applied to the corresponding z scores. The following special 
cases, taken from Table 8.1, are just a few of the statistics of practical interest. In each case the results 
hold for infinite populations or for sampling with replacement. For sampling without replacement from 
finite populations, the results must be modified. See page 182. 

1. Means. Here S = X, the sample mean; fi s — \i% — fi, the population mean; and a s — — cr/^/N, 
where a is the population standard deviation and N is the sample size. The z score is given by 

_ X-fi 

When necessary, the sample deviation s or s is used to estimate a. 

2. Proportions. Here S = P, the proportion of "successes" in a sample; /j, s = /j, P = p, where p is the 
population proportion of successes and N is the sample size; and a s = o> = ^JpqjN, where 
q=l-p. 

The z score is given by 




In case P = X/N, where X is the actual number of successes in a sample, the z score becomes 



X-Np 




That is, fi x = M = Np, a x = o = \/Npq, and S = X. 
The results for other statistics can be obtained similarly. 

OPERATING-CHARACTERISTIC CURVES; THE POWER OF A TEST 

We have seen how the Type I error can be limited by choosing the significance level properly. It is 
possible to avoid risking Type II errors altogether simply by not making them, which amounts to never 
accepting hypotheses. In many practical cases, however, this cannot be done. In such cases, use is often 
made of operating-characteristic curves, or OC curves, which are graphs showing the probabilities of 
Type II errors under various hypotheses. These provide indications of how well a given test will enable us 
to minimize Type II errors; that is, they indicate the power of a test to prevent us from making wrong 
decisions. They are useful in designing experiments because they show such things as what sample sizes 
to use. 

p- VALUES FOR HYPOTHESES TESTS 



The /rvalue is the probability of observing a sample statistic as extreme or more extreme than the 
one observed under the assumption that the null hypothesis is true. When testing a hypothesis, state the 
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value of a. Calculate your /rvalue and if the /rvalue <oc, then reject H . Otherwise, do not reject H . 
For testing means, using large samples in > 30), calculate the /rvalue as follows: 

1. For H : n = |a and H x : \i<\i , /rvalue = P(Z< computed test statistic), 

2. For H : |i=u and H x : u>(i , /rvalue = P(Z> computed test statistic), and 

3. For H Q : n=u and H x : u^(i , /rvalue = P(Z<—\ computed test statistic |) + P(Z> | computed 
test statistic |). 

The computed test statistic is f . where x is the mean of the sample, s is the standard deviation 

(s/y/n) 

of the sample, and p, is the value specified for p, in the null hypothesis. Note that if a is unknown, it is 
estimated from the sample by using s. This method of testing hypothesis is equivalent to the method of 
finding a critical value or values and if the computed test statistic falls in the rejection region, reject the 
null hypothesis. The same decision will be reached using either method. 



CONTROL CHARTS 

It is often important in practice to know when a process has changed sufficiently that steps should be 
taken to remedy the situation. Such problems arise, for example, in quality control. Quality control 
supervisors must often decide whether observed changes are due simply to chance fluctuations or are 
due to actual changes in a manufacturing process because of deteriorating machine parts, employees' 
mistakes, etc. Control charts provide a useful and simple method for dealing with such problems 
(see Problem 10.16). 



TESTS INVOLVING SAMPLE DIFFERENCES 
Differences of Means 

Let X x and X 2 be the sample means obtained in large samples of sizes and N 2 drawn from 
respective populations having means p x and p, 2 an d standard deviations o x and er 2 . Consider the null 
hypothesis that there is no difference between the population means (i.e., p x = fi 2 ), which is to say that 
the samples are drawn from two populations having the same mean. 

Placing p x —p, 2 in equation (5) of Chapter 8, we see that the sampling distribution of 
differences in means is approximately normally distributed, with its mean and standard deviation 
given by 



Vxi-xi — 




where we can, if necessary, use the sample standard deviations s x and s 2 (or s x and s 2 ) as estimates of a x 
and a 2 . 

By using the standardized variable, or z score, given by 

_ X x -X 2 -0 = X l -X 2 

°~X\-X1 CX1-X2 

we can test the null hypothesis against alternative hypotheses (or the significance of an observed differ- 
ence) at an appropriate level of significance. 

Differences of Proportions 

Let P x and P 2 be the sample proportions obtained in large samples of sizes N x and N 2 drawn from 
respective populations having proportions p x and p 2 . Consider the null hypothesis that there is no 
difference between the population parameters (i.e., p x = p 2 ) and thus that the samples are really 
drawn from the same population. 
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Placing Pl = p 2 = p in equation (<5) of Chapter 8, we see that the sampling distribution of differences 
in proportions is approximately normally distributed, with its mean and standard deviation given by 

Hpi-n = and a P1 _p2 = ^p q Q- + ±?j (J) 

N l P l +N 2 P 2 



where 



N\ + N 2 

is used as an estimate of the population proportion and where q = 1 — p. 
By using the standardized variable 

z = p 1 -p 2 -o = p 1 -p 2 {4) 

a P\-P2 a P\-Pl 

we can test observed differences at an appropriate level of significance and thereby test the null 
hypothesis. 

Tests involving other statistics can be designed similarly. 



TESTS INVOLVING BINOMIAL DISTRIBUTIONS 

Tests involving binomial distributions (as well as other distributions) can be designed in a manner 
analogous to those using normal distributions; the basic principles are essentially the same. See Problems 
10.23 to 10.28. 



Solved Problems 



TESTS OF MEANS AND PROPORTIONS, USING NORMAL DISTRIBUTIONS 

10.1 Find the probability of getting between 40 and 60 heads inclusive in 100 tosses of a fair coin. 
SOLUTION 

According to the binomial distribution, the required probability is 

Since Np = 1000 and Nq = 1000 are both greater than 5, the normal approximation to the binomial 
distribution can be used in evaluating this sum. The mean and standard deviation of the number of heads in 
100 tosses are given by 

H = Np = 1000 = 50 and o = s/Npq = ^/(1OO)00 = 5 

On a continuous scale, between 40 and 60 heads inclusive is the same as between 39.5 and 60.5 heads. 
We thus have 

39.5 in standard units = 39 ' 5 g 50 = -2.10 60.5 in standard units = 60 ' 5 5 50 = 2.10 

Required probability = area under normal curve between z = -2.10 and z = 2.10 
= 2(area between z = and z = 2.10) = 2(0.4821) = 0.9642 
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10.2 To test the hypothesis that a coin is fair, adopt the following decision rule: 

Accept the hypothesis if the number of heads in a single sample of 100 tosses is between 40 

and 60 inclusive. 

Reject the hypothesis otherwise. 

(a) Find the probability of rejecting the hypothesis when it is actually correct. 

(b) Graph the decision rule and the result of part (a). 

(c) What conclusions would you draw if the sample of 100 tosses yielded 53 heads? And if it 
yielded 60 heads? 

(d) Could you be wrong in your conclusions about part (c)? Explain. 
SOLUTION 

(a) From Problem 10.1, the probability of not getting between 40 and 60 heads inclusive if the coin 
is fair is 1 — 0.9642 = 0.0358. Thus the probability of rejecting the hypothesis when it is correct 
is 0.0358. 

(b) The decision rule is illustrated in Fig. 10-2, which shows the probability distribution of heads in 100 
tosses of a fair coin. If a single sample of 100 tosses yields a z score between -2.10 and 2.10, we accept 
the hypothesis; otherwise, we reject the hypothesis and decide that the coin is not fair. 

The error made in rejecting the hypothesis when it should be accepted is the Type 1 error of 
the decision rule; and the probability of making this error, equal to 0.0358 from part (a), is 
represented by the total shaded area of the figure. If a single sample of 100 tosses yields a number 
of heads whose z score (or z statistic) lies in the shaded regions, we would say that this z score 
differed significantly from what would be expected if the hypothesis were true. For this reason, 
the total shaded area (i.e., the probability of a Type I error) is called the significance level of the 
decision rule and equals 0.0358 in this case. Thus we speak of rejecting the hypothesis at the 0.0358 
(or 3.58%) significance level. 

(c) According to the decision rule, we would have to accept the hypothesis that the coin is fair 
in both cases. One might argue that if only one more head had been obtained, we would have 
rejected the hypothesis. This is what one must face when any sharp line of division is used in making 
decisions. 

(d) Yes. We could accept the hypothesis when it actually should be rejected — as would be the case, for 
example, when the probability of heads is really 0.7 instead of 0.5. The error made in accepting the 
hypothesis when it should be rejected is the Type II error of the decision. 




Fig. 10-2 Standard normal curve showing the acceptance and rejection regions for testing that 
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10.3 Using the binomial distribution and not the normal approximation to the binomial distribution, 
design a decision rule to test the hypothesis that a coin is fair if a sample of 64 tosses of the coin is 
taken and a significance level of 0.05 is used. Use MINITAB to assist with the solution. 



The binomial plot of probabilities when a fair coin is tossed 64 times is given it 
cumulative probabilities generated by MINITAB are shown below the Fig. 10-3. 




Fig. 10-3 MINITAB plot of the binomial distribution for n = 64 and p = 0.5. 





Probability 
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21 


. 0022285 


0. 0040735 
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.0000000 


22 


. 0043556 


0. 0084291 


10 


.0000000 


.0000000 


23 


. 0079538 


0. 0163829 


11 


.0000000 


.0000001 


24 


. 0135877 


0.0299706 


12 


. 0000002 


.0000002 


25 
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We see that P(X< 23) = 0.01638. Because the distribution is symmetrical, we also know 
P(X>41) = 0.01638. The rejection region {X<23 and X>A\) has probability 2(0.01638) = 0.03276. 
The rejection region {X< 24 and X> 40} would exceed 0.05. When the binomial distribution is used we cannot 
have a rejection region equal to exactly 0.05. The closest we can get to 0.05 without exceeding it is 0.03276. 

Summarizing, the coin will be flipped 64 times. It will be declared unfair or not balanced if 23 or fewer 
heads or 41 or more heads are obtained. The chance of making a Type 1 error is 0.03276 which is as close to 
0.05 as you can get without exceeding it. 



10.4 Refer to Problem 10.3. Using the binomial distribution and not the normal approximation to the 
binomial distribution, design a decision rule to test the hypothesis that a coin is fair if a sample 
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of 64 tosses of the coin is taken and a significance level of 0.05 is used. Use EXCEL to assist with 
the solution. 

SOLUTION 

The outcomes through 64 are entered into column A of the EXCEL worksheet. The expressions 
= BINOMDIST(Al, 64, 0.5,0) and =BINOMDIST(A1,64,0.5, 1) are used to obtain the binomial and 
cumulative binomial distributions. The for the fourth parameter requests individual probabilities and 
the 1 requests cumulative probabilities. A click-and-drag in column B gives the individual probabilities. 



in colurr 


n C a click-and-drag gives the 


cumulative probabilities. 






A 




B 




C 




A 


B 




c 


X 


Probabiln 


ty 






Probability 







5 


42101E- 


20 


5.42101E 


20 


13 


7 . 12151E-07 


9 


40481E-07 


1 


3 


46945E 


18 


3 . 52366E 


18 


14 


2 . 59426E-06 


3 


53474E-06 


2 


1 


09288E 


16 


1.12811E 


16 


15 


8.64754E-06 


1 


21823E-05 


3 


2 


25861E 


15 


2.37142E 


15 


16 


2 . 64831E-05 


3 


86654E-05 


4 


3 


44438E 


14 


3 . 68152E 


14 


17 


7.47758E-05 





000113441 


5 


4 
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13 


4.50141E 


13 


18 
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6 
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12 


4.51451E 


12 


19 


.000472706 
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7 


3 


36762E 


11 


3 . 81907E 


11 


20 


.001063587 





001844982 


8 


2 


39943E 


10 


2.78134E 


10 


21 


.002228469 
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9 


1 
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09 
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09 


22 
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10 
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09 
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09 
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11 


4 
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08 
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08 


24 
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12 


1 


78038E- 


07 


2 .28331E 


07 


25 


. 021740344 





051710939 



It is found, as in Problem 10.3, that P{X< 23) = 0.01638 and because of symmetry, P(X> 41) = 0.01638 
and that the rejection region is { X<23 or X>41} and the significance level is 0.01638 + 0.01638 
or 0.03276. 




Fig. 10-4 Determining the Z value that will give a critical region equal to 0.05. 



10.5 In an experiment on extrasensory perception (ESP), an individual (subject) in one room is asked 
to state the color (red or blue) of a card chosen from a deck of 50 well-shuffled cards by an 
individual in another room. It is unknown to the subject how many red or blue cards are in the 
deck. If the subject identifies 32 cards correctly, determine whether the results are significant at the 
(a) 0.05 and (b) 0.01 levels. 
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SOLUTION 

If p is the probability of the subject choosing the color of a card correctly, then we have to decide 
between two hypotheses: 

H : p = 0.5, and the subject is simply guessing (i.e., the results are due to chance). 
H] : p > 0.5, and the subject has powers of ESP. 

Since we are not interested in the subject's ability to obtain extremely low scores, but only in the ability 
to obtain high scores, we choose a one-tailed test. If hypothesis H is true, then the mean and standard 
deviation of the number of cards identified correctly are given by 

fi = Np = 50(0.5) = 25 and a = s/Npq = v'50(0.5)(0.5) = VTX5 = 3.54 

(a) For a one-tailed test at the 0.05 significance level, we must choose z in Fig. 10-4 so that the shaded area 
in the critical region of high scores is 0.05. The area between and z is 0.4500, and z = 1.645; this can 
also be read from Table 10.1. Thus our decision rule (or test of significance) is: 

If the z score observed is greater than 1.645, the results are significant at the 0.05 level and the 
individual has powers of ESP. 

If the z score is less than 1.645, the results are due to chance (i.e., not significant at the 0.05 level). 

Since 32 in standard units is (32 - 25)/3.54 = 1.98, which is greater than 1.645, we conclude at the 
0.05 level that the individual has powers of ESP. 

Note that we should really apply a continuity correction, since 32 on a continuous scale is between 
31.5 and 32.5. However, 31.5 has a standard score of (31.5 - 25)/3.54 = 1.84, and so the same con- 
clusion is reached. 

(b) If the significance level is 0.01, then the area between and z is 0.4900, from which we conclude that 
z = 2.33. 

Since 32 (or 31.5) in standard units is 1.98 (or 1.84), which is less than 2.33, we conclude that the results 
are not significant at the 0.01 level. 

Some statisticians adopt the terminology that results significant at the 0.01 level are highly significant. 
that results significant at the 0.05 level but not at the 0.01 level are probably significant, and that results 
significant at levels larger than 0.05 are not significant. According to this terminology, we would conclude 
that the above experimental results are probably significant, so that further investigations of the phenomena 
are probably warranted. 

Since significance levels serve as guides in making decisions, some statisticians quote the actual prob- 
abilities involved. For instance, since pr{z > 1.84} = 0.0322, in this problem, the statistician could say that 
on the basis of the experiment the chances of being wrong in concluding that the individual has powers of 
ESP are about 3 in 100. The quoted probability (0.0322 in this case) is called the p-value for the test. 



10.6 The claim is made that 40% of tax filers use computer software to file their taxes. In a sample of 
50, 14 used computer software to file their taxes. Test H Q : p = 0A versus H a : p<0A at a = 0.05 
where p is the population proportion who use computer software to file their taxes. Test using the 
binomial distribution and test using the normal approximation to the binomial distribution. 

SOLUTION 

If the exact test of H n :p = 0A versus ff a : p<0.4 at a = 0.05 is used, the null is rejected if X< 15. This is 
called the rejection region. If the test based on the normal approximation to the binomial is used, the null is 
rejected if Z< -1.645 and this is called the rejection region. X= 14 is called the test statistic. The binomial 
test statistic is in the rejection region and the null is rejected. Using the normal approximation, the test 
statistic is z = = —1.73. The actual value of a is 0.054 and the rejection region is X< 15 and the 
cumulative binomial probability P(X< 15) is used. If the normal approximation is used, you would also 
reject since z = -1.73 is in the rejection region which is Z<- 1.645. Note that if the binomial distribution is 
used to perform the test, the test statistic has a binomial distribution. If the normal distribution is used to test 
the hypothesis, the test statistic, Z, has a standard normal distribution. 
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10.7 The ^-value for a test of hypothesis is defined to be the smallest level of significance at which the 
null hypothesis is rejected. This problem illustrates the computation of the p-value for a statistical 
test. Use the data in Problem 9.6 to test the null hypothesis that the mean height of all the trees on 
the farm equals 5 feet (ft) versus the alternative hypothesis that the mean height is less than 5 ft. 
Find the p-value for this test. 

SOLUTION 

The computed value for z is z = (59.22 - 60)/ 1.01 = -0.77. The smallest level of significance at which 
the null hypothesis would be rejected is />-value = P(z < -0.77) = 0.5 - 0.2794 = 0.2206. The null hypoth- 
esis is rejected if the ^-value is less than the pre-set level of significance. In this problem, if the level of 
significance is pre-set at 0.05, then the null hypothesis is not rejected. The MINITAB solution is as follows 
where the subcommand Alternative-1 indicates a lower-tail test. 

MTB > ZTest mean = 60 sd = 10. Ill data in cl ; 
SUBO Alternative -1. 

Z-Test 

Test of mu = 60 . 00 vs mu < 60 . 00 
The assumed sigma = 10 . 1 

Variable N Mean StDev SE Mean Z P 

height 100 59.22 10.11 1.01 -0.77 0.22 

10.8 A random sample of 33 individuals who listen to talk radio was selected and the hours per week 
that each listens to talk radio was determined. The data are as follows. 

987486887 10 8 10 67789 

658568785 58 76645 

Test the null hypothesis that \i = 5 hours (h) versus the alternative hypothesis that [i ^ 5 at level 
of significance a = 0.05 in the following three equivalent ways: 

(a) Compute the value of the test statistic and compare it with the critical value for a = 0.05. 

(b) Compute the /rvalue corresponding to the computed test statistic and compare the /rvalue 
with a = 0.05. 

(c) Compute the 1 - a = 0.95 confidence interval for fi and determine whether 5 falls in this 
interval. 



256 



STATISTICAL DECISION THEORY 



[CHAP. 10 



SOLUTION 

In the following MINITAB output, the standard deviation is found first, and then specified in the 
Ztest statement and the Zinterval statement. 

MTB > standard deviation cl 
Standard deviation of hours = 1.6005 

MTB > ZTest 5.0 1.6005 'hours' ; 
SUBO Alternative 0. 



Test of mu = 5.000 vs mu not = 5.000 
The assumed sigma =1.60 

Variable N Mean StDev SE Mean Z P 

hours 33 6.897 1.600 0.279 6.81 0.0000 

MTB > Zinterval 95.0 1.6005 'hours' . 

Variable N Mean StDev SE Mean 95.0 % CI 

hours 33 6.897 1.600 0.279 ( 6.351, 7.443) 

(a) The computed value of the test statistic is Z = ^ ^279 ^ = 6-81, the critical values are ±1.96, and the 
null hypothesis is rejected. Note that this is the computed value shown in the MINITAB output. 

(b) The computed ^-value from the MINITAB output is 0.0000 and since the p-value < a = 0.05, the null 
hypothesis is rejected. 

(c) Since the value specified by the null hypothesis, 5, is not contained in the 95% confidence interval for p., 
the null hypothesis is rejected. 

These three procedures for testing a null hypothesis against a two-tailed alternative are equivalent. 



10.9 The breaking strengths of cables produced by a manufacturer have a mean of 1 800 pounds (lb) 
and a standard deviation of 1001b. By a new technique in the manufacturing process, it is claimed 
that the breaking strength can be increased. To test this claim, a sample of 50 cables is tested and 
it is found that the mean breaking strength is 18501b. Can we support the claim at the 0.01 
significance level? 

SOLUTION 

We have to decide between the two hypotheses: 

H : fi = 18001b, and there is really no change in breaking strength. 
Hi : fi> 18001b, and there is a change in breaking strength. 

A one-tailed test should be used here; the diagram associated with this test is identical with Fig. 10-4 of 
Problem 10.5(a). At the 0.01 significance level, the decision rule is: 

If the z score observed is greater than 2.33, the results are significant at the 0.01 level and H is rejected. 
Otherwise, H is accepted (or the decision is withheld). 
Under the hypothesis that H is true, we find that 

z _X-n _ 1850 - 1800 _ 3 55 
Z ~a/VN~ lOO/x/50 



which is greater than 2.33. Hence we conclude that the results are highly significant and that the claim should 
thus be supported. 
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p- VALUES FOR HYPOTHESES TESTS 



10.10 A group of 50 Internet shoppers were asked how much they spent per year on the Internet. Their 
responses are shown in Table 10.2. It is desired to test that they spend $325 per year versus it is 
different from $325. Find the /rvalue for the test of hypothesis. What is your conclusion for 
a = 0.05? 



Table 10.2 



The mean of the data is 304.60, the standard deviation is 101.51, the computed test statistic is 

z = •^'^ — = —1.43. The Z statistic has an approximate standard normal distribution. The computed 
101.50/v^0 

p- value is the following P(Z < - | computed test statistic | ) or Z > | computed test statistic | ) or P(Z < - 1 .43) 
+ P(Z>1.43). The answer may be found using Appendix II, or using EXCEL. Using EXCEL, the p-value 
=2* NORMSDIST(- 1 .43) = 0. 1 527, since the normal curve is symmetrical and the areas to the left of - 1 .43 
and to the right of 1 .43 are the same, we may simply double the area to the left of — 1 .43. Because the p-value 
is not less than 0.05, we do not reject the null hypothesis. Refer to Fig. 10-6 to see the graphically the 
computed p-value for this problem. 




Z=-1.43 Z=1.43 



Fig. 10-6 The p- value is the sum of the area to the left of Z = - 1 .43 and the area to the right of 
Z=1.43. 



10.11 Refer to Problem 10. 10. Use the statistical software MINITAB to analyze the data. Note that the 
software gives the /rvalue and you, the user, are left to make a decision about the hypothesis 
based on whatever value you may have assigned to a. 
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SOLUTION 

The pull-down "Stat => Basic statistics => 1 sample Z" gives the following analysis. The test statistic and 
the p-value are computed for you. 

One-Sample Z: Amount 

Test of mu = 325 vs not = 325 

The assumed standard deviation = 101 . 51 

Variable N Mean StDev SE Mean Z P 

Amount 50 304.460 101.508 14.356 -1.43 0.152 

Note that the software gives the value of the test statistic (-1.43) and the /7-value (0.152). 



10.12 A survey of individuals who use computer software to file their tax returns is shown in Table 10.3. 
The recorded response is the time spent in filing their tax returns. The null hypothesis is 
H Q : u. = 8.5 hours versus the alternative hypothesis H l : ^<8.5. Find the /rvalue for the test 
of hypothesis. What is your conclusion for a = 0.05? 



The mean of the data in Table 10.3 is 7.42 h, the standard deviation is 2.91 h and the computed test 

statistic is Z = — = -2.62. The Z statistic has an approximate standard normal distribution. The 
2.91/v^0 

MINITAB pull-down menu "Calc => Probability distribution Normal" gives the dialog box shown in 
Fig. 10-7. The dialog box is filled in as shown. 

The output given by the dialog box in Fig. 10-7 is as follows. 



Cumulative Distribution Function 

Normal with mean = and standard deviation = 1 

x P( X < = x ) 

-2.62 0.0043965 

The /7-value is 0.0044 and since the /rvalue < a, the null hypothesis is rejected. Refer to Fig. 10-8 to 
see graphically the computed value of the /rvalue for this problem. 
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C Probability density 

<• Cumulative probability 

r Inverse cumulative probability 






Mean: u o 

Standard deviation: 10 






C Input column: 

Optional storage: | 
P Input constant: |-2 6 2| 




Optional storage: [ 






Help | 


OK Cancel | 







Fig. 10-7 Dialog box for figuring the p-value when the test statistic equals —2.62. 




Z=-2.62 

Fig. 10-8 The p-value is the area to the left of Z= -2.62. 



10.13 Refer to Problem 10.12. Use the statistical software SAS to analyze the data. Note that the 
software gives the /7-value and you, the user, are left to make a decision about the hypothesis 
based on whatever value you may have assigned to a. 

SOLUTION 

The SAS output is shown below. The ^-value is shown as Prob > z = 0.0044, the same value as was 
obtained in Problem 10.12. It is the area under the standard normal curve to the left of -2.62. Compare the 
oilier quantities in the SAS output with those in Problem 10.12. 

SAS OUTPUT: 

One Sample Z Test for a Mean 
Sample Statistics for time 

N Mean Std. Dev. Std. Error 



50 



7.42 



2.91 



0.41 
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Hypothesis Test 

Mean of time => 8 
Mean of time < £ 
'ith a specified known standard deviatii 
Z Statistic Prob > 



Null hypothesis: 



-2.619 0.0044 
95% Confidence Interval for the Mean 
(Upper Bound Only) 

Lower Limit Upper Limit 



-infinity 8.10 
Note that the 95% one-sided interval (-oo, 
ign that the null hypothesis should be rejected a 



.10) does not contain the null value, 8.5. This is another 
the a = 0.05 level. 



10.14 It is claimed that the average time spent listening to MP3's by those who listen to these devices is 
5.5 h per week versus the average time is greater than 5.5. Table 10.4 gives the time spent listening 
to an MP3 player for 50 individuals. Test H : u = 5.5 h versus the alternative H l : [l > 5.5 h. 
Find the /rvalue for the test of hypothesis using the computer software STAT1STIX. What is 
your conclusion for a = 0.05? 



SOLUTION 

The STATISTIX package gives 
Statistix 8 . 
Descriptive Statistics 
Variable N Mean 

MP3 50 5.9700 



The computed test statistic is Z = 



1.0158/^/50 



= 3.27. The p-value is computed in Fig. 10-9. 



Figure 10-10 shows graphically the computed value of the p- value for this problem. 

The computed p-value is 0.00054 and since this is less than 0.05 the null hypothesis is rejected. 
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Function 

C Beta(x,a,b) 

C Binomial (x, n, p) 

f Chi-square (x.. d() 

'" ' Correlation (x, n) 

C F (K, dfnum, dlden) 

C F Inverse (p. dr'num. dfden'l 

C Hj>pergeo(x1,x2,n1.n2) 

C Meg-Binomial (n+x, rt p] 

C Poisson (x, lambda) 

r T 1-Tai(itdf] 

r T 2-Tail(x,df] 

T Inverse (p, df) 
(• Z1 -Tail fx) 
T Z 2-Tm! (k] 

Z Inverse (p) 



1: * 1 


Close 1 




Print All | 


Help 



1 3. 27 




Fig. 10-10 The ^-value is the area to the right of Z= 3.27 



10.15 Using SPSS and the pull-down "Analyze =>• Compare means =>• one-sample t test" and the data in 
Problem 10.14, test H : fx = 5.5 h versus the alternative H : [i> 5.5 h at a = 0.05 by finding the 
p-va\ue and comparing it with a. 



SOLUTION 

The SPSS output is as follows: 



One-Sample Statistics 

Mean Std. Deviation 



Std. Error Mean 
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One-Sample Test 





Test Value=5.5 




df 


Sig. (2-tailed) 


Mean 
Difference 


95% Confidence 
Interval of the 
Diffrence 
Lower Upper 


MPE 


3.272 


49 


0.002 


.47000 


.1813 .7587 



In the first portion of the SPSS output, the needed statistics are given. Note that the computed test 
statistic is referred to as t, not z. This is because for w>30, the t and the z distribution are very 
similar. The t distribution has a parameter called (he degrees of freedom that is equal to n—\. 
The /j-value computed by SPSS is always a two-tailed Rvalue and is referred to as Sig.(2-tailed). It is 
equal to 0.002. The 1-tailed value is 0.002/2 = 0.001. This is close to the value found in Problem 10.14 
which equals 0.00054. When using computer software, the user must become aware of the idiosyncrasies 
of the software. 



CONTROL CHARTS 



10.16 A control chart is used to control the amount of mustard put into containers. The mean amount 
of fill is 496 grams (g) and the standard deviation is 5 g. To determine if the machine filling the 
mustard containers is in proper working order, a sample of 5 is taken every hour for all eight h in 
the day. The data for two days is given in Table 10.5. 

(a) Design a decision rule whereby one can be fairly certain that the mean fill is being main- 
tained at 496 g with a standard deviation equal to 5 g for the two days. 

(b) Show how to graph the decision rule in part (a). 



492.2 
487.9 
493.8 



486.2 
489.5 
495.9 
494.1 
494.0 



493.6 
503.2 
486.0 



497.8 
493.4 
495.8 
508.0 



5 

503.4 
493.4 
493.9 
493.8 
501.3 



494.9 
492.3 
502.9 
502.8 
498.9 



7 

497.5 
497.0 
493.8 
497.1 
488.3 



490.5 
503.0 
496.4 
489.7 
492.6 



10 



16 



492.2 
487.9 
493.8 
495.4 
491.7 



486.2 
489.5 
495.9 
494.1 
494.0 



493.6 
503.2 
486.0 



508.6 
497.8 



493.9 
493.8 
501.3 



494.9 
492.3 
502.9 



497.5 
497.0 
493.8 
497.1 



490.5 
503.0 
496.4 
489.7 
492.6 



STATISTICAL DECISION THEORY 



SOLUTION 

(a) With 99.73% confidence we can say that the sample mean x must lie in the range u Y - 3cr v to u Y + 3ct y 

or n - 3—= to fj, + 3 -p. Since u = 496, a = 5, and n = 5, it follows that with 99.73% confidence, the 

V" V" 5 5 

sample mean should fall in the interval 496 - 3 —= to 496 + 3 —= or between 489.29 and 502.7 1 . Hence 

our decision rule is as follows: 

If a sample mean falls inside the range 489.29 g to 502.71 g, assume the machine fill is correct. 
Otherwise, conclude that the filling machine is not in proper working order and seek to determine 
the reasons for the incorrect fills. 

(b) A record of the sample means can be kept by the use of a chart such as shown in Fig. 10-11, called 
a quality control chart. Each time a sample mean is computed, it is represented by a particular point. 
As long as the points lie between the lower limit and the upper limit, the process is under control. 
When a point goes outside these control limits, there is a possibility that something is wrong and 
investigation is warranted. 

The 80 observations are entered into column CI. The pull down menu "Stat Control 
Charts => Variable charts for subgroups Xbar" gives a dialog box which when filled in gives the control 
chart shown in Fig. 10-11. 

Xbar Chart of Amount 

504 



502 

500 

S 498 
S 

1 496 
E 

W 494 

492 
490 



2 4 6 8 10 12 14 16 
Sample 

Fig. 10-11 Control chart with 3a limits for controlling the mean fill of mustard containers. 

The control limits specified above are called the 99.73% confidence limits, or briefly, the 3a limits. 
Other confidence limits (such as 99% or 95% limits) could be determined as well. The choice in each case 
depends on the particular circumstances. 

TESTS INVOLVING DIFFERENCES OF MEANS AND PROPORTIONS 

10.17 An examination was given to two classes consisting of 40 and 50 students, respectively. In the first 
class the mean grade was 74 with a standard deviation of 8, while in the second class the mean 
grade was 78 with a standard deviation of 7. Is there a significant difference between the perfor- 
mance of the two classes at the (a) 0.05 and (b) 0.01 levels? 
SOLUTION 

Suppose that the two classes come from two populations having the respective means ^, and 
We thus need to decide between the hypotheses: 

H : hi = fi 2 , and the difference is due merely to chance. 

H\ ■ fJ.\ / (J.2, an d there is a significant difference between the classes. 
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Under hypothesis H , both classes come from the same population. The mean and standard deviation of the 
difference in means are given by 



V40 50 

where we have used the sample standard deviations as estimates of a, and a 2 . Thus 
Xi-X _ 74- 78 _ _ 2 49 

(a) For a two-tailed test, the results are significant at the 0.05 level if z lies outside the range - 1 .96 to 1 .96. 
Hence we conclude that at the 0.05 level there is a significant difference in performance between the 
two classes and that the second class is probably better. 

(b) For a two-tailed test, the results are significant at the 0.01 level if z lies outside the range -2.58 and 2.58. 
Hence we conclude that at the 0.01 level there is no significant difference between the classes. 

Since the results are significant at the 0.05 level but not at the 0.01 level, we conclude that the results are 
probably significant (according to the terminology used at the end of Problem 10.5). 



10.18 The mean height of 50 male students who showed above-average participation in college athletics 
was 68.2 inches (in) with a standard deviation of 2.5 in, while 50 male students who showed no 
interest in such participation had a mean height of 67.5 in with a standard deviation of 2.8 in. 
Test the hypothesis that male students who participate in college athletics are taller than other 
male students. 

SOLUTION 

We must decide between the hypotheses: 

H : /ij = /i 2 , and there is no difference between the mean heights. 

H\ : /ij > H2, and the mean height of the first group is greater than that of the second group. 
Under hypothesis H , 



where we have used the sample standard deviations as estimates of o~\ and a 2 . Thus 

_X l -X 2 

a Xl-X2 

Using a one-tailed test at the 0.05 significance level, we would reject hypothesis H if the z score were greater 
than 1.645. Thus we cannot reject the hypothesis at this level of significance. 

It should be noted, however, that the hypothesis can be rejected at the 0.10 level if we are willing to take 
the risk of being wrong with a probability of 0.10 (i.e., 1 chance in 10). 




10.19 A study was undertaken to compare the mean time spent on cell phones by male and female 
college students per week. Fifty male and 50 female students were selected from Midwestern 
University and the number of hours per week spent talking on their cell phones determined. 
The results in hours are shown in Table 10.6. It is desired to test H : - ^ 2 = versus 
H a : \i { — \i 2 ^ based on these samples. Use EXCEL to find the /rvalue and reach a decision 
about the null hypothesis. 
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Table 10.6 Hours spent talking oi 



e for males and females at Midwestern 



SOLUTION 

The data in Table 10.6 are entered into an EXCEL worksheet as shown in Fig. 10-12. The male dala are 
entered into A2 : El 1 and the female data are entered into F2:J1 1. The variance ol" the male dala is computed 
by entering =VAR(A2:E11) into A14. The variance of female data is computed by entering 
= VAR(F2:J11) into A15. The mean of male data is computed by entering = AVERAGE(A2 : El 1) 



File Ectt View Insert Format look Eata Window Help 



Ma - a 
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Fig. 10-12 EXCEL worksheet for computing the />-value in problem 10-19. 
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into A16. The mean of female data is computed by entering = AVERAGE(F2 : J 1 1 ) into A17. The test 
statistic is =(A16-A17)/SQRT(A14/50+A15/50) and is shown in A19. The test statistic has a standard 
normal distribution and a value of 0.304. The expression = 2*(l-NORMSDIST(A19)) computes the area 
to the right of 0.304 and doubles it. This gives a />-value of 0.761. 

Since the p-value is not smaller than any of the usual a values such as 0.01 or 0.05, the null hypothesis is 
not rejected. The probability of obtaining samples like the one we obtained is 0.761, assuming the null 
hypothesis to be true. Therefore, there is no evidence to suggest that the null hypothesis is false and that it 
should be rejected. 

10.20 Two groups, A and B, consist of 100 people each who have a disease. A serum is given to group A 
but not to group B (which is called the control); otherwise, the two groups are treated identically. 
It is found that in groups A and B, 75 and 65 people, respectively, recover from the disease. 
At significance levels of (a) 0.01, (b) 0.05, and (c) 0.10, test the hypothesis that the serum 
helps cure the disease. Compute the /?-value and show that /rvalue > 0.01, /?-value>0.05, but 
/?-value<0.10. 
SOLUTION 

Let pi and p 2 denote the population proportions cured by (1) using the serum and (2) not using the 
serum, respectively. We must decide between two hypotheses: 

Ho ■ P\ = Pi, and the observed differences are due to chance (i.e. the serum is ineffective). 
H\ ■ P\ > P2, and the serum is effective. 
Under hypothesis H , 



= and = V M {-k + £) = v (0,70) (0 " 30) (t^ + t^) = °- 0648 

where we have used as an estimate of p the average proportion of cures in the two sample groups given by 
(75 + 65)/200 = 0.70, and where q=\-p = 0.30. Thus 

_ Pi -P 2 _ 0.750 -0.650 _ 
2 ~ °p\-P2 ~ 01)648 " 

(a) Using a one-tailed test at the 0.01 significance level, we would reject hypothesis H only if the z score 
were greater than 2.33. Since the z score is only 1.54, we must conclude that the results are due to 
chance at this level of significance. 

(b) Using a one-tailed test at the 0.05 significance level, we would reject H only if the z score 
were greater than 1.645. Hence we must conclude that the results are due to chance at this 
level also. 

(c) If a one-tailed test at the 0.10 significance level were used, we would reject H only if the z score were 
greater than 1.28. Since this condition is satisfied, we conclude that the serum is effective at the 
0.10 level. 

(d) Using EXCEL, the p-value is given by = 1 - NORMSDIST(1.54) which equals 0.06178. This is the 
area to the right of 1.54. Note that the p-value is greater than 0.01, 0.05, but is less than 0.10. 
Note that these conclusions depend how much we are willing to risk being wrong. If the results are 

actually due to chance, but we conclude that they are due to the serum (Type I error), we might proceed to 
give the serum to large groups of people — only to find that it is actually ineffective. This is a risk that we are 
not always willing to assume. 

On the other hand, we could conclude that the serum does not help, whereas it actually does help 
(Type II error). Such a conclusion is very dangerous, especially if human lives are at stake. 

10.21 Work Problem 10.20 if each group consists of 300 people and if 225 people in group A 
and 195 people in group B are cured. Find the /rvalue using EXCEL and comment on 
your decision. 
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SOLUTION 

Note that in this case the proportions of people cured in the two groups are 225/300 = 0.750 and 
195/300 = 0.650, respectively, which are the same as in Problem 10.20. Under hypothesis H , 

, Pl _ P2 = and an _ n = ^ pq (±. + ±>) =y / (0.70)(0.30)^ + ^ ) ) = 0.0374 
where (225 + 195)/600 = 0.70 is used as an estimate of p. Thus 
_ Pj -P 2 _ 0.750- 0.650 
Z ~ <rpi-P2 ~ 0.0374 ~ ' 
Since this value of z is greater than 2.33, we can reject the hypothesis at the 0.01 significance level; that is, 
we can conclude that the serum is effective with only a 0.01 probability of being wrong. 

This shows how increasing the sample size can increase the reliability of decisions. In many cases, 
however, it may be impractical to increase sample sizes. In such cases we are forced to make decisions on 
the basis of available information and must therefore contend with greater risks of incorrect decisions. 
/?-value= 1 - NORMSDIST(2.67) = 0.003793. This is less than 0.01. 

10.22 A sample poll of 300 voters from district A and 200 voters from district B showed that 56% and 
48%, respectively, were in favor of a given candidate. At a significance level of 0.05, test the 
hypotheses (a) that there is a difference between the districts and (b) that the candidate is preferred 
in district A (c) calculate the p-vdAxxe for parts (a) and (b). 

SOLUTION 

Let pi and p 2 denote the proportions of all voters from districts A and B, respectively, who are in favor 
of the candidate. Under the hypothesis H : p x = p 2 , we have 

"«-« = ° Md ^ = f^hk) = V /( °- 528)( °- 472) (300 + 20o) =°- 0456 

where we have used as estimates of p and q the values [(0.56)(300) + (0.48)(200)]/500 = 0.528 and 
(1 - 0.528) = 0.472, respectively. Thus 

_ Pj -P 2 _ 0.560- 0.480 _ 
Z ~ a Pl _ P2 ~ 0.0456 ~ ' 

(a) If we wish only to determine whether there is a difference between the districts, we must decide between 
the hypotheses H : p x = p 2 and H x : p x ^ p 2 , which involves a two-tailed test. Using a two-tailed test at 
the 0.05 significance level, we would reject H if z were outside the interval -1.96 to 1.96. Since z = 1.75 
lies inside this interval, we cannot reject H at this level; that is, there is no significant difference between 
the districts. 

(b) If we wish to determine whether the candidate is preferred in district A, we must decide between the 
hypotheses H : p x = p 2 and H x : p x > p 2 , which involves a one-tailed test. Using a one-tailed test at the 
0.05 significance level, we would reject H if z were greater than 1.645. Since this is the case, we can 
reject H at this level and conclude that the candidate is preferred in district A. 

(c) For the two-tailed alternative, the p-value = 2*(l-NORMSDIST(1.75)) = 0.0801. You cannot reject the 
null hypothesis at a = 0.05. For the one-tailed alternative, p- value =1-NORMSDIST(1.75) = 0.04006. 
You can reject the null hypothesis at alpha = 0.05. 



TESTS INVOLVING BINOMIAL DISTRIBUTIONS 

10.23 An instructor gives a short quiz involving 10 true-false questions. To test the hypothesis that 
students are guessing, the instructor adopts the following decision rule: 

If seven or more answers are correct, the student is not guessing. 
If less than seven answers are correct, the student is guessing. 
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Find the following probability of rejecting the hypothesis when it is correct using (a) the binomial 
probability formula and (b) EXCEL. 

SOLUTION 

(a) Let p be the probability that a question is answered correctly. The probability of getting X problems 
out of 10 correct is (^)p x q la ~ x . where q = 1 - p. Then under the hypothesis p = 0.5 (i.e., the student is 
guessing), 

Pr{7 or more correct} = Pr{7 correct} + Pr{8 correct} + Pr{9 correct} + Pr{10 correct} 




Thus the probability of concluding that students are not guessing when in fact they are guessing is 0.1719. 
Note that this is the probability of a Type I error. 

(b) Enter the numbers 7, 8, 9, and 10 into A1:A4 of the EXCEL worksheet. Next 
enter = BINOMDIST(Al, 10,0.5,0). Next perform a click-and-drag from Bl to B4. In B5 
enter = SUM(B1 : B4). The answer appears in B5. 

A B 

7 0.117188 

8 0.043945 

9 0.009766 
10 0.000977 

0.171875 

10.24 In Problem 10.23, find the probability of accepting the hypothesis p = 0.5 when actually p = 0.7. 
Find the answer (a) using the binomial probability formula and (b) using EXCEL. 

SOLUTION 

(a) Under the hypothesis p = 0.7, 

Pr{less than 7 correct} = 1 - Pr{7 or more correct} 

= 1 - [(>.7)W + (>^ 
= 0.3504 

(b) The EXCEL solution is: 

Pr{less than 7 correct when p = 0J} is given by = BINOMDIST(6, 10,0.7,1) which equals 0.350389. 
The 1 in the function BINOMDIST says to accumulate from to 6 the binomial probabilities with 
n=W and p = 0J. 

10.25 In Problem 10.23, find the probability of accepting the hypothesis p = 0.5 when actually 
(a) p = 0.6, (b) p = 0.8, (c) p = 0.9, (d) p = 0.4, (e) p = 0.3, (/) p = 0.2, and (g) p = 0.1. 

SOLUTION 

(a) If/7 = 0.6, 

Required probability = 1 - [Pr{7 correct} + Pr{8 correct} + Pr{9 correct} + Pr{ 10 correct}] 
^-[(^(O.e)^)^ 
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The results for parts (b) through (g) can be found similarly and are shown in Table 10.7, together with 
the values corresponding to p = 0.5 and to p = 0.7. Note that the probability is denoted in Table 10.7 by j3 
(probability of a Type II error); the P entry for p = 0.5 is given by (3 = 1 - 0.1719 = 0.828 (from Problem 
10.23), and the f3 entry for p = 0.7 is from Problem 10.24. 




10.27 The null hypothesis is that a die is fair and the alternative hypothesis is that the die is biased such 
that the face six is more likely to occur than it would be if the die were fair. The hypothesis is 
tested by rolling it 18 times and observing how many times the face six occurs. Find the p-value 
if the face six occurs 7 times in the 18 rolls of the die. 

SOLUTION 

Zero through 18 are entered into A1:A19 in the EXCEL worksheet. = BINOMDIST(Al, 18, 0.16666, 0) 
is entered in Bl and a click-and-drag is performed from Bl to B19 to give the individual binomial 
probabilities. = BINOMDlST(Al, 18,0.16666, 1) is entered into CI and a click-and-drag is performed 
from CI to C19 to give the cumulative binomial probabilities. 



270 



STATISTICAL DECISION THEORY 



[CHAP. 10 



0.037566 
0.135233 



0.18389 
0.102973 

0.04462 
0.015297 
0.004207 
0.000935 
0.000168 
2.45E-05 
2.85E-06 
2.64E-07 



0.037566446 
0.17279916 
0.402683738 
0.647882186 
0.831772194 
0.934745656 
0.979365347 
0. 



0.999804143 



0.999999717 
0.99999998 



15 1E-09 

16 3.76E-11 

17 8.86E-13 

18 9.84E-15 

The p-value is p{x > 7} = 1 - P{X < 6} = 1 - 
but not at a = 0.01. 



0.979 = 0.021. The outcome Z=6 is significa 



10.28 To test that 40% of taxpayers use computer software when figuring their taxes against the 
alternative that the percent is greater than 40%, 300 taxpayers are randomly selected and 
asked if they use computer software. If 131 of 300 use computer software, find the p- value for 
this observation. 



300 



A 



|130 



Fig. 10-14 Binomial dialog box for computing 130 or fewer computer software users in 300 when 
40% of all taxpayers use computer software. 
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SOLUTION 

The null hypothesis is H : p = 0.4 versus the alternative H a : p>0.4. The observed valued of X is 131, 
where X is the number who use computer software. The /j-value = P{X > 131 when p = 0A}. 
The p- value = 1 - P{X < 130 when p = 0.4} Using MINITAB, the pull down "Calc^ Probability 
Distribution => Binomial" gives the dialog box shown in Fig 10-14. 

The dialog box in Fig. 10-14 produces the following output. 

Cumulative Distribution Function 

Binomial with n=300 and p=0 . 4 



x P(X< = x) 

130 0.891693 

The .p-value is 1 - P{X < 130when^ = 0.4} = 1 - 0.8971 = 0.1083. The outcome X= 131 is not 
significant at 0.01, 0.05, or 0.10. 



Supplementary Problems 

TESTS OF MEANS AND PROPORTIONS, USING NORMAL DISTRIBUTIONS 

10.29 An urn contains marbles that are either red or blue. To test the null hypothesis of equal proportions of these 
colors, we agree to sample 64 marbles with replacement, noting the colors drawn, and to adopt the following 
decision rule: 

Accept the null hypothesis if 28 < X< 36, where X= number of red marbles in the 64. 
Reject the null hypothesis if X< 27 or if X> 37. 

(a) Find the probability of rejecting the null hypothesis when it is correct. 

(b) Graph the decision rule and the result obtained in part (a). 

10.30 (a) What decision rule would you adopt in Problem 10.29 if you require that the probability of rejecting the 

hypothesis when it is actually correct be no more than 0.01 (i.e., you want a 0.01 significance level)? 

(b) At what level of confidence would you accept the hypothesis? 

(c) What would the decision rule be if the 0.05 significance level were adopted? 

10.31 Suppose that in Problem 10.29 we wish to test the hypothesis that there is a greater proportion of red than 
blue marbles. 

(a) What would you take as the null hypothesis, and what would be the alternative hypothesis? 

(b) Should you use a one- or a two-tailed test? Why? 

(c) What decision rule should you adopt if the significance level is 0.05? 

(d) What is the decision rule if the significance level is 0.01? 



10.32 A pair of dice is tossed 100 times and it is observed that 7's appear 23 times. Test the hypothesis that the dice 
are fair (i.e., not loaded) at the 0.05 significance level by using (a) a two-tailed test and (b) a one-tailed test. 
Discuss your reasons, if any, for preferring one of these tests over the other. 



10.33 



Work Problem 10.32 if the significance level is 0.01. 
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10.34 A manufacturer claimed that at least 95% of the equipment that she supplied to a factory conformed to 
specifications. An examination of a sample of 200 pieces of equipment revealed that 18 were faulty. Test her 
claim at significance levels of (a) 0.01 and (b) 0.05. 

10.35 The claim is made that Internet shoppers spend on the average $335 per year. It is desired to test that this 
figure is not correct at a = 0.075. Three hundred Internet shoppers are surveyed and it is found that the 
sample mean = $354 and the standard deviation = $125. Find the value of the test statistic, the critical 
values, and give your conclusion. 

10.36 It has been found from experience that the mean breaking strength of a particular brand of thread is 
9.72 ounces (oz) with a standard deviation of 1.40 oz. A recent sample of 36 pieces of this thread showed 
a mean breaking strength of 8.93 oz. Test the null hypothesis H : /x = 9.72 versus the alternative 
H : ^<9.72 by giving the value of the test statistic and the critical value for a = 0.10 and a = 0.025. 
Is the result significant at a = 0.10. Is the result significant at a = 0.025? 

10.37 A study was designed to test the null hypothesis that the average number of e-mails sent weekly by 
employees in a large city equals 25.5 versus the mean is greater than 25.5. Two hundred employees were 
surveyed across the city and it was found that x = 30.3 and s= 10.5. Give the value of the test statistic, 
the critical value for a = 0.03, and your conclusion. 

10.38 For large n (n > 30) and known standard deviation the standard normal distribution is used to perform a test 
concerning the mean of the population from which the sample was selected. The alternative hypothesis 
H. d : fj, < tt is called a lower-tailed alternative and the alternative hypothesis H. A : /i > /z is called a upper- 
t ailed alternative. For an upper-tailed alternative, give the EXCEL expression for the critical value if a = 0.1, 
a = 0.01, and a = 0.001. 

p- VALUES FOR HYPOTHESES TESTS 

10.39 To test a coin for its balance, it is flipped 15 times. The number of heads obtained is 12. Give the p-value 
corresponding to this outcome. Use BINOMDIST of EXCEL to find the />-value. 

10.40 Give the /j-value for the outcome in Problem 10.35. 

10.41 Give the p-value for the outcome in Problem 10.36. 

10.42 Give the /j-value for the outcome in Problem 10.37. 
QUALITY CONTROL CHARTS 

10.43 In the past a certain type of thread produced by a manufacturer has had a mean breaking strength of 8.64 oz 
and a standard deviation of 1.28 oz. To determine whether the product is conforming to standards, a sample 
of 16 pieces of thread is taken every 3 hours and the mean breaking strength is determined. Record the 

(a) 99.73% (or 3a), (b) 99%, and (c) 95% control limits on a quality control chart and explain their 
applications. 

10.44 On average, about 3% of the bolts produced by a company are defective. To maintain this quality of 
performance, a sample of 200 bolts produced is examined every 4 hours. Determine the (a) 99% and 

(b) 95% control limits for the number of defective bolts in each sample. Note that only upper control limits 
are needed in this case. 

TESTS INVOLVING DIFFERENCES OF MEANS AND PROPORTIONS 

10.45 A study compared the mean lifetimes in hours of two types of types of light bulbs. The results of the study 
are shown in Table 10.8. 
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Test H : Mi - a* 2 = ^ 
compute the ^-value and coi 



;rsus H,: ih - lh + 
ipare the p-value with a 



. a = 0.05. Give the valu 
= 0.05. Give your conclus 



of the test statistic and 



A study compared the grade point averages (GPAS) of 50 high school seniors with a TV in their bedroom 
with the GPAS of 50 high school seniors without a TV in their bedrooms. The results are shown in Table 
10.9. The alternative is that the mean GPA is greater for the group with no TV in their bedroom. Give the 
value of the test statistic assuming no difference in mean GPAs. Give the ^-value and your conclusion for 
a = 0.05 and for a = 0.10. 

Table 10.9 



10.47 On an elementary school spelling examination, the mean grade of 32 boys was 72 with a standard deviation 
of 8, while the mean grade of 36 girls was 75 with a standard deviation of 6. The alternative is that the girls 
are better at spelling than the boys. Give the value of the test statistic assuming no difference in boys and 
girls at spelling. Give the p-value and your conclusion for a = 0.05 and for a = 0.10. 

10.48 To test the effects of a new fertilizer on wheat production, a tract of land was divided into 60 squares of equal 
areas, all portions having identical qualities in terms of soil, exposure to sunlight, etc. The new fertilizer was 
applied to 30 squares and the old fertilizer was applied to the remaining squares. The mean number of 
bushels (bu) of wheat harvested per square of land using the new fertilizer was 18.2 bu with a standard 
deviation of 0.63 bu. The corresponding mean and standard deviation for the squares using the old fertilizer 
were 17.8 and 0.54 bu, respectively. Using significance levels of (a) 0.05 and (b) 0.01, test the hypothesis that 
the new fertilizer is better than the old one. 



10.49 Random samples of 200 bolts manufactured by machine A and of 100 bolts manufactured by machine B 
showed 19 and 5 defective bolts, respectively. 

(a) Give the test statistic, the p-value, and your conclusion at a = 0.05 for testing that the two machines show 
different qualities of performance 

(b) Give the test statistic, the ^-value, and your conclusion at a = 0.05 for testing that machine B is perform- 
ing better than machine A. 

10.50 Two urns, A and B, contain equal numbers of marbles, but the proportion of red and white marbles in 
each of the urns is unknown. A sample of 50 marbles from each urn is selected from each urn with 
replacement. There are 32 red in the 50 from urn A and 23 red in the 50 from urn B. 
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(a) Test, at a = 0.05, that the proportion of red is the same versus the proportion is different by giving the 
computed test statistic, the computed />-value and your conclusion. 

(b) Test, at a = 0.05, that A has a greater proportion than B by giving the computed test statistic, the p- value, 
and your conclusion. 

10.51 A coin is tossed 15 times in an attempt to determine whether it is biased so that heads are more likely to 
occur than tails. Let X= the number of heads to occur in 15 tosses. The coin is declared biased in favor of 
heads if X> 11. Use EXCEL to find a. 

10.52 A coin is tossed 20 times to determine if it is unfair. It is declared unfair if X=0, 1, 2, 18, 19, 20 where 
X= the number of tails to occur. Use EXCEL to find a. 

10.53 A coin is tossed 15 times in an attempt to determine whether it is biased so that heads are more likely 
to occur than tails. Let X=the number of heads to occur in 15 tosses. The coin is declared biased in favor of 
heads if X> 11. Use EXCEL find |3 Hp = 0.6. 

10.54 A coin is tossed 20 times to determine if it is unfair. It is declared unfair if X = 0, 1, 2, 18, 19, 20 where 
Z=the number of tails to occur. Use EXCEL to find p" if/> = 0.9. 

10.55 A coin is tossed 15 times in an attempt to determine whether it is biased so that heads are more likely to 
occur than tails. Let X= the number of heads to occur in 15 tosses. The coin is declared biased in favor of 
heads if X> 11. Find the p-value for the outcome X= 10. Compare the p-value with the value of a in this 
problem. 

10.56 A coin is tossed 20 times to determine if it is unfair. It is declared unfair if X=0, 1, 2, 3, 4, 16, 17, 18, 19, 20 
where X=the number of tails to occur. Find the p-value for the outcome X= 17. Compare the p- value with 
the value of a in this problem. 

10.57 A production line manufactures cell phones. Three percent defective is considered acceptable. A sample of 
size 50 is selected from a day's production. If more than 3 are found defective in the sample, it is concluded 
that the defective percent exceeds the 3% figure and the line is stopped until it meets the 3% figure. Use 
EXCEL to determine a? 

10.58 In Problem 10.57, find the probability that a 4% defective line will not be shut down. 

10.59 To determine if it is balanced, a die is rolled 20 times. It is declared to be unbalanced so that the face six 
occurs more often than 1 /6 of the time if more than 5 sixes occur in the 20 rolls. Find the value for a. If the 
die is rolled 20 times and a six occurs 6 times, find the ^-value for this outcome. 




Small Sampling 
Theory 



SMALL SAMPLES 

In previous chapters we often made use of the fact that for samples of size TV > 30, called large 
samples, the sampling distributions of many statistics are approximately normal, the approximation 
becoming better with increasing N. For samples of size TV < 30, called small samples, this approximation 
is not good and becomes worse with decreasing N, so that appropriate modifications must be made. 

A study of sampling distributions of statistics for small samples is called small sampling theory. 
However, a more suitable name would be exact sampling theory, since the results obtained hold for large 
as well as for small samples. In this chapter we study three important distributions: Student's t distribu- 
tion, the chi-square distribution, and the F distribution. 



STUDENT'S t DISTRIBUTION 

Let us define the statistic 

s/Vn 1 ; 

which is analogous to the z statistic given by 

_X-\i 
Z ~ a/y/N' 

If we consider samples of size N drawn from a normal (or approximately normal) population with 
mean p and if for each sample we compute t, using the sample mean X and sample standard deviation s 
or s, the sampling distribution for t can be obtained. This distribution (see Fig. 11-1) is given by 

Y T (2) 
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where Y is a constant depending on N such that the total area under the curve is 1, and where the 
constant v = (N — 1) is called the number of degrees of freedom (v is the Greek letter nii). 

Distribution (2) is called Student 's t distribution after its discoverer, W. S. Gossett, who published 
his works under the pseudonym "Student" during the early part of the twentieth century. 

For large values of v or N (certainly N > 30) the curves (2) closely approximate the standardized 
normal curve 




as shown in Fig. 11-1. 



CONFIDENCE INTERVALS 

As done with normal distributions in Chapter 9, we can define 95%, 99%, or other confidence 
intervals by using the table of the t distribution in Appendix III. In this manner we can estimate within 
specified limits of confidence the population mean \x. 

For example, if — 1 975 and 1 975 are the values of t for which 2.5% of the area lies in each tail of the 
t distribution, then the 95% confidence interval for t is 



s that \i is estimated to lie ii 
X-t 



with 95% confidence (i.e. probability 0.95). Note that t gi5 represents the 97.5 percentile value, while 
t 025 = — ^975 represents the 2.5 percentile value. 

In general, we can represent confidence limits for population means by 

X±t c ^J= (5) 



where the values ±t c , called critical values or confidence coefficients, depend on the level of confidence 
desired and on the sample size. They can be read from Appendix III. 

The sample is assumed to be taken from a normal population. This assumption may be checked out 
using the Komogorov-Smirnov test for normality. 
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A comparison of equation (5) with the confidence limits (X ± z c a/\fN) of Chapter 9, shows that 
for small samples we replace z c (obtained fro m the norm al distribution) with t c (obtained from 
the t distribution) and that we replace a with s/N/(N - \)s = s, which is the sample estimate of a. 
As N increases, both methods tend toward agreement. 



TESTS OF HYPOTHESES AND SIGNIFICANCE 

Tests of hypotheses and significance, or decision rules (as discussed in Chapter 10), are easily 
extended to problems involving small samples, the only difference being that the z score, or z statistic, 
is replaced by a suitable t score, or t statistic. 

1. Means. To test the hypothesis H Q that a normal population has mean fi, we use the t score 
(or t statistic) 



X-fi 



where X is the mean of a sample of size N. This is analogous to using the z score 
_ X-n 
Z ~ a/y/N 



for large 7Y, except that s = y/N/(N - l)s is used in place of a. The difference is that while z is 
normally distributed, t follows Student's distribution. As 7Y increases, these tend toward agreement. 

2. Differences of Means. Suppose that two random samples of sizes N\ and N 2 are drawn from normal 
populations whose standard deviations are equal (o x = tr 2 ). Suppose further that these two samples 
have means given by X x and X 2 and standard deviations given by s\ and s 2 , respectively. To test 
the hypothesis H that the samples come from the same population (i.e., \x x = ^ 2 as we U as ®\ = ^i), 
we use the t score given by 



X x -X 2 . N x s\ + N 2 s\ 

t = — = where a = \ ^ (7) 

ay/l/Ni + \/N 2 V A^i + 7V 2 - 2 

The distribution of t is Student's distribution with v = N x + N 2 — 2 degrees of freedom. The use of 
equation (7) is made plausible on placing a x = a 2 = a in the z score of equation (2) of Chapter 10 and 
then using as an estimate of a the weighted mean 

(Ni - l)sj + (N 2 - l)sj ^ N lS j + N 2 s 2 2 
{N l -\) + {N 2 -\) N y +N 2 -2 

where s\ and s\ are the unbiased estimates of u\ and a\. 



THE CHI-SQUARE DISTRIBUTION 

Let us define the statistic 

2 ^.v 2 {X { - X) 2 + (X 2 - Xf + ■■■ + (X N - Xf 

x = — = 2 y°> 

where x is trie Greek letter chi and x i s rea( l "chi-square." 

If we consider samples of size N drawn from a normal population with standard deviation a, and if 
for each sample we compute x 2 , a sampling distribution for x 2 can be obtained. This distribution, called 
the chi-square distribution, is given by 



Y=Y ( x 2 ) {l/2}{l/ - 2} e-^ 2 = Y oX »- 2 e-W* 



(9) 
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where v = N — 1 is the number of degrees of freedom, and Y is a constant depending on v such that the 
total area under the curve is 1. The chi-square distributions corresponding to various values of v are 
shown in Fig. 11-2. The maximum value of Y occurs at \ = v ~ 2 for v > 2. 




5 10 15 20 

Fig. 11-2 Chi-square distributions with (a) 2, (b) 4. (c) 6, and (d) 10 degrees of freedom. 



CONFIDENCE INTERVALS FOR a 

As done with the normal and t distribution, we can define 95%, 99%, or other confidence limits 
by using the table of the \ distribution in Appendix IV. In this manner we can estimate within 
specified limits of confidence the population standard deviation a in terms of a sample standard 
deviation s. 

For example, if x 2 025 an d ^.975 are the values of \ (called critical values) for which 2.5% of the area 
lies in each tail of the distribution, then the 95% confidence interval is 

7 Ns 2 7 

X7o25 < ~^ < X 975 (10) 

from which we see that a is estimated to lie in the interval 

^<a<^ (11) 

X.975 X.025 

with 95% confidence. Other confidence intervals can be found similarly. The values x.025 an d X.095 
represent, respectively, the 2.5 and 97.5 percentile values. 

Appendix IV gives percentile values corresp ondi ng to the number of degrees of freedom v. For large 
values of v (v > 30), we can use the fact that (\/2x 2 - \l2v - 1) is very nearly normally distributed with 
mean and standard deviation 1; thus normal distribution tables can be used if v > 30. Then if xj, and z p 
are the pth percentiles of the chi-square and normal distributions, respectively, we have 

(12) 

In these cases, agreement is close to the results obtained in Chapters 8 and 9. 
For further applications of the chi-square distribution, see Chapter 12. 



DEGREES OF FREEDOM 

In order to compute a statistic such as (7) or (8), it is necessary to use observations obtained from a 
sample as well as certain population parameters. If these parameters are unknown, they must be esti- 
mated from the sample. 
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The number of degrees of freedom of a statistic, generally denoted by v, is denned as the number N 
of independent observations in the sample (i.e., the sample size) minus the number k of population 
parameters, which must be estimated from sample observations. In symbols, v = N — k. 

In the case of statistic (7), the number of independent observations in the sample is N, from which 
we can compute X and s. However, since we must estimate /x, k = 1 and so v — N — 1 . 

In the case of statistic (8), the number of independent observations in the sample is N, from which 
we can compute s. However, since we must estimate a, k = 1 and so v = N — 1 . 



THE F DISTRIBUTION 

As we have seen, it is important in some applications to know the sampling distribution of the 
difference in means (X l — X 2 ) of two samples. Similarly, we may need the sampling distribution of the 
difference in variances {S\ — S 2 ). It turns out, however, that this distribution is rather complicated. 
Because of this, we consider instead the statistic S\/S 2 , since a large or small ratio would indicate a 
large difference, while a ratio nearly equal to 1 would indicate a small difference. The sampling distribu- 
tion in such a case can be found and is called the F distribution, named after R. A. Fisher. 

More precisely, suppose that we have two samples, 1 and 2, of sizes N x and N 2 , respectively, drawn 
from two normal (or nearly normal) populations having variances a\ and of. Let us define the statistic 

F , = g/gj = N^/jN, - \)a\ 

S\lo\ N 2 Sl/{N 2 -\)a\ [J> 



Then the sampling distribution of F is called Fisher's F distribution, or briefly the F distribution, with 
i^! = TVi — 1 and v 2 — N 2 — 1 degrees of freedom. This distribution is given by 

Y = — r^Tn i 15 ) 



where C is a constant depending on and v 2 such that the total area under the curve is 1 . The curve has 
a shape similar to that shown in Fig. 1 1-3, although this shape can vary considerably for different values 
of v\ and v 2 . 




Fig. 11-3 The solid c 



s the F distribution with 4 and 2 degrees of freedom and the dashed c 
F distribution with 5 and 10 degrees of freedom. 
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Appendixes V and VI give percentile values of F for which the areas in the right-hand tail are 0.05 
and 0.01, denoted by F 95 and F 99 , respectively. Representing the 5% and 1% significance levels, these 
can be used to determine whether or not the variance S\ is significantly larger than sj. In practice, 
the sample with the larger variance is chosen as sample 1 . 

Statistical software has added to the ability to find areas under the Student t distribution, the chi- 
square distribution, and the F distribution. It has also added to our ability to sketch the various 
distributions. We will illustrate this in the solved problems section in this chapter. 



Solved Problems 




11.1 The graph of Student's t distribution with nine degrees of freedom is shown in Fig. 11-4. Use 
Appendix III to find the values of a for which (a) the area to the right of a is 0.05, (b) the total 
shaded area is 0.05, (c) the total unshaded area is 0.99, (d) the shaded area on the left is 0.01, and 
(e) the area to the left of a is 0.90. Find (a) through (<?) using EXCEL. 

SOLUTION 

(a) If the shaded area on the right is 0.05, the area to the left of a is (1 -0.05) = 0.95 and a represents the 95th 
percentile, t 95 . Referring to Appendix III, proceed downward under the column headed v until reaching 
entry 9, and then proceed right to the column headed r_ 95 ; the result, 1.83, is the required value of t. 

(b) If the total shaded area is 0.05, then the shaded area on the right is 0.025 by symmetry. Thus the area to 
the left of a is (1 -0.025) = 0.975 and a represents the 97.5th percentile, t_ 975 . From Appendix III 
we find 2.26 to be the required value of t. 

(c) If the total unshaded area is 0.99, then the total shaded area is (1 - 0.99) = 0.01 and the shaded area to 
the right is 0.01/2 = 0.005. From Appendix III we find that / 995 = 3.25. 

(d) If the shaded area on the left is 0.0 1 , then by symmetry the shaded area on the right is 0.0 1 . From Appendix 
III, / 99 = 2.82. Thus the critical value of / for which the shaded area on the left is 0.01 equals -2.82. 

(<?) If the area to the left of a is 0.90, the a corresponds to the 90th percentile, t 90 , which from Appendix 111 
equals 1.38. 
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Using EXCEL, the expression = TINV(0. 1,9) gives 1.833113. EXCEL requires the sum of the areas 
in the two tails and the degrees of freedom. Similarly, =TINV(0.05,9) gives 2.262157, =TINV(0.01,9) 
gives 3.249836, =TINV(0.02,9) gives 2.821438, and =TINV(0.2,9) gives 1.383029. 



11.2 Find the critical values of t for which the area of the right-hand tail of the t distribution is 0.05 if 
the number of degrees of freedom, v, is equal to (a) 16, (b) 27, and (c) 200. 

SOLUTION 

Using Appendix III, we find in the column headed 1 95 the values (a) 1.75, corresponding to v = 16; 
(b) 1.70, corresponding to v = 27; and (c) 1.645, corresponding to v = 200. (The latter is the value that 
would be obtained by using the normal curve; in Appendix III it corresponds to the entry in the last row 
marked oo, or infinity.) 



11.3 The 95% confidence coefficients (two-tailed) for the normal distribution are given by ±1.96. What 
are the corresponding coefficients for the t distribution if (a) v = 9, (b) v = 20, (c) v = 30, and 
(d) v = 60? 

SOLUTION 

For the 95% confidence coefficients (two-tailed), the total shaded area in Fig. 1 1-4 must be 0.05. Thus the 
shaded area in the right tail is 0.025 and the corresponding critical value of t is t 915 . Then the required confidence 
coefficients are ±/, 975 ; for the given values ofv, these are (a) ±2.26, (b) ±2.09, (c) ±2.04, and (d) ±2.00. 



11.4 A sample of 10 measurements of the diameter of a sphere gave a mean X = 438 centimeters (cm) 
and a standard deviation s = 0.06 cm. Find the (a) 95% and (b) 99% confidence limits for the 
actual diameter. 

SOLUTION 

(a) The 95% confidence limits are given by X ± t, 915 (s/VN- 1). 

Since v = N - 1 = 10 - 1 = 9, we find /. 975 = 2.26 [see also Problem 1 1 .3(a)]. Then, using X = 4.38 
and s = 0.06, the required 95% confidence limits are 4.38 ± 2.26(0.06/x/T0^T) = 4.38 ± 0.0452cm. 
Thus we can be 95% confident that the true mean lies between (438 - 0.045) = 4.335 cm and 
(4.38 + 0.045) = 4.425 cm. 

(b) The 99% confidence limits are given by X ± t M5 (s/VN- 1). 

For v = 9, ?.99 5 = 3.25. Then the 99% confidence limits are 4.38 ± 3.25(0.06/VlO - 1) = 
4.38 ± 0.0650 cm, and the 99% confidence interval is 4.315 to 4.445 cm. 



11.5 The number of days absent from work last year due to job-related cases of carpal tunnel syn- 
drome were recorded for 25 randomly selected workers. The results are given in Table 11.1. When 
the data are used to set a confidence interval on the mean of the population of all job-related cases 
of carpal tunnel syndrome, a basic assumption underlying the procedure is that the number of 
days absent are normally distributed for the population. Use the data to test the normality 
assumption and if you are willing to assume normality, then set a 95% confidence interval on /z. 
Table 11.1 



21 23 33 32 37 



40 37 29 23 29 



24 32 24 46 32 



17 29 26 46 27 
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The normal probability plot from MINITAB (Fig. 1 1-5) indicates that it would be reasonable to assume 
normality since the ^-value exceeds 0.15. This p- value is used to test the null hypothesis that the data were 
selected from a normally distributed population. If the conventional level of significance, 0.05, is used then 
normality of the population distribution would be rejected only if the />-value is less than 0.05. Since the 
;>value associated with the Kolmogorov-Smirnov test for normality is reported as p-value > 0.15, we do not 
reject the assumption of normality. 

The confidence interval is found using MINITAB as follows. The 95% confidence interval for the 
population mean extends from 27.21 to 33.59 days per year. 

MTB > tinterval 95% confidence for data in cl 



Confidence Intervals 



StDev 
7.72 



95.0 % CI 
(27.21, 33.59) 



11.6 In the past, a machine has produced washers having a thickness of 0.050 inch (in). To determine 
whether the machine is in proper working order, a sample of 10 washers is chosen, for which the 
mean thickness is 0.053 in and the standard deviation is 0.003 in. Test the hypothesis that the 
machine is in proper working order, using significance levels of (a) 0.05 and (b) 0.01. 




Fig. 11-5 Normal probability plot and Kolmogorov-Smirnov test of normality. 
SOLUTION 

We wish to decide between the hypotheses: 
H : [i = 0.050, and the machine is in proper working order 
H x : /i / 0.050, and the machine is not in proper working order 
Thus a two-tailed test is required. Under hypothesis H , we have 

X - ii 



= 3.00 



0.053 - 0.050 
' = — VJV - 1 = 0.003 V 
For a two-tailed test at the 0.05 significance level, we adopt the decision rule: 

Accept H if / lies inside the interval — 1 915 to / 975 , which for 10 — 1 =9 degrees of freedor 
interval —2.26 to 2.26. 
Reject H otherwise 
Since t = 3.00, we reject H at the 0.05 level. 
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(b) For a two-tailed test at the 0.01 significance level, we adopt the decision rule: 

Accept H Q if t lies inside the interval —7.995 t° /. 995 , which for 10 - 1 = 9 degrees of freedom is the 
interval -3.25 to 3.25. 
Reject H otherwise 
Since t = 3.00, we accept H at the 0.01 level. 

Because we can reject H at the 0.05 level but not at the 0.01 level, we say that the sample result is 
probably significant (see the terminology at the end of Problem 10.5). It would thus be advisable to check the 
machine or at least to take another sample. 



11.7 A mall manager conducts a test of the null hypothesis that li = $50 versus the alternative hypoth- 
esis that li ^ $50, where pi represents the mean amount spent by all shoppers making purchases at 
the mall. The data shown in Table 1 1.2 give the dollar amount spent for 28 shoppers. The test of 
hypothesis using the Student's t distribution assumes that the data used in the test are selected 
from a normally distributed population. This normality assumption may be checked out using 
anyone of several different tests of normality. MINITAB gives 3 different choices for a normality 
test. Test for normality at the conventional level of significance equal to a = 0.05. If the normality 
assumption is not rejected, then proceed to test the hypothesis that li = $50 versus the alternative 
hypothesis that li + $50 at a = 0.05. 



The Anderson-Darling normality test from MINITAB gives a p-value = 0.922, the Ryan-Joyner nor- 
mality test gives the p- value as greater than 0.10, and the Kolmogorov-Smirnov normality test gives the 
p- value as greater than 0.15. In all 3 cases, the null hypothesis that the data were selected from a normally 
distributed population would not be rejected at the conventional 5% level of significance. Recall that the null 
hypothesis is rejected only if the />-valueis less than the preset level of significance. The MINITAB analysis for 
the test of the mean amount spent per customer is shown below. If the classical method of testing hypothesis is 
used then the null hypothesis is rejected if the computed value of the test statistic exceeds 2.05 in absolute 
value. The critical value, 2.05, is found by using Student's t distribution with 27 degrees of freedom. Since the 
computed value of the test statistic equals 18.50, we would reject the null hypothesis and conclude that the 
mean amount spent exceeds $50. If the p-value approach is used to test the hypothesis, then since the computed 
/>-value = 0.0000 is less than the level of significance (0.05), we also reject the null hypothesis. 



Data Display 

Amount 

68 54 57 62 

65 45 24 52 

57 45 65 60 

40 



72 49 92 74 56 
94 59 76 36 75 
36 64 33 50 66 



MTB > TTest . 'Amount ' 
SUBC > Alternative 0. 
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T-Test of the Mean 

Test of mu = . 00 vs mu not = 0.00 

Variable N Mean StDev SE Mean t p 

Amount 28 58.07 16.61 3.14 18.50 0.0000 



11.8 The intelligence quotients (IQs) of 16 students from one area of a city showed a mean of 107 and a 
standard deviation of 10, while the IQs of 14 students from another area of the city showed a 
mean of 1 12 and a standard deviation of 8. Is there a significant difference between the IQs of the 
two groups at significance levels of (a) 0.01 and (b) 0.05? 



SOLUTION 

If fj, x and ^ 2 denote the population mean IQs of students from the two a 
decide between the hypotheses: 

H : /ii = fj, 2 , and there is essentially no difference between the groups. 
H\ ■ Mi 7^ M2> an d there is a significant difference between the groups. 
Under hypothesis H , 



i, respectively, we have to 



ay/l/Ni + \/N 2 



N lS j + N 2 s 2 2 



/ l6(10) 2 +14(^ 
'"V 16+14-2 ~ y - 44 



9.44^1/16+1/14 



(a) Using a two-tailed test at the 0.01 significance level, we would reject H if t were outside the range —t. 995 
to t .995, which for (JVj + N 2 - 2) = (16 + 14 - 2) = 28 degrees of freedom is the range -2.76 to 2.76. 
Thus we cannot reject H at the 0.01 significance level. 

(b) Using a two-tailed test at the 0.05 significance level, we would reject H if / were outside the range —t 915 
to ? 975 , which for 28 degrees of freedom is the range -2.05 to 2.05. Thus we cannot reject H at the 
0.05 significance levels. 

We conclude that there is no significant difference between the IQs of the two groups. 



11.9 The costs (in thousands of dollars) for tuition, room, and board per year at 15 randomly 
selected public colleges and 10 randomly selected private colleges are shown in Table 11.3. Test 
the null hypothesis that the mean yearly cost at private colleges exceeds the mean yearly 
cost at public colleges by 10 thousand dollars versus the alternative hypothesis that the 
difference is not 10 thousand dollars. Use level of significance 0.05. Test the assumptions of 
normality and equal variances at level of significance 0.05 before performing the test concerning 
the means. 



Table 11.3 



F 


ublic Colleges 




Private Colleges 


4.2 


9.1 




11.6 


13.0 


17.7 


6.1 


7.7 




10.4 


18.8 


17.6 


4.9 


6.5 




5.0 


13.2 


19.8 


8.5 


6.2 




10.4 


14.4 


16.8 


4.6 


10.2 




8.1 


17.7 


16.1 
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The Anderson-Darling normality test from M1N1TAB for the public colleges data is shown in 
Fig. 1 1-6. Since the />-value (0.432) is not less than 0.05, the normality assumption is not rejected. A similar 
test for private colleges indicates that the normality assumption is valid for private colleges. 

Probability plot of public 

Normal 




Public 

Test for equal variances for public, private 



F-Test 
Test Statistic 1 .09 
P-Value 0.920 



Levene's Test 
Test Statistic 0.21 
P-Value 0.651 



95% Bonferroni confidence intervals for StDevs 




Fig. 11-6 Anderson-Darling test of normality and F-test of equal 

The F test shown in the lower part of Fig. 1 1 -6 indicates that equal variances may be assumed. The pull- 
down menu "Stat Basic Statistics => 2-sample t" gives the following output. The output indicates that we 
cannot reject that the cost at private colleges exceeds the cost at public colleges by 10,000 dollars. 

Two-Sample T-Test and CI: Public, Private 

Two-sample T for Public vs Private 

N Mean StDev SE Mean 

Public 15 7.57 2.42 0.62 

Private 10 16.51 2.31 0.73 

Difference = mu (Public) - mu (Private) 
Estimate for difference: -8.9433 
95% CI for difference: (-10.9499, -6.9367) 
T-Test of difference = -10 (vs not =) : T-Valu. 
DF = 23 



= 1.09 P-Value = 0.287 



2 Pooled StDev = 2.3760 
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THE CHI-SQUARE DISTRIBUTION 




11.10 The graph of the chi-square distribution with 5 degrees of freedom is shown in Fig. 11-7. Using 
Appendix IV, find the critical values of x 2 for which (a) the shaded area on the right is 0.05, 
(b) the total shaded area is 0.05, (c) the shaded area on the left is 0.10, and (d) the shaded area on 
the right is 0.01. Find the same answers using EXCEL. 

SOLUTION 

(a) If the shaded area on the right is 0.05, then the area to the left of b is ( 1 - 0.05) = 0.95 and b represents 
the 95th percentile, x.95- Referring to Appendix IV, proceed downward under the column headed v 
until reaching entry 5, and then proceed right to the column headed \-* 95 ; the result, 11.1, is the 
required critical value of \ . 

(b) Since the distribution is not symmetrical, there are many critical values for which the total shaded area 
is 0.05. For example, the right-hand shaded area could be 0.04 while the left-hand shaded area is 0.01. It 
is customary, however, unless otherwise specified, to choose the two areas to be equal. In this case, then, 
each area is 0.025. If the shaded area on the right is 0.025, the area to the left of b is 1 - 0.025 = 0.975 
and b represents the 97.5th percentile, x 2 975> which from Appendix IV is 12.8. Similarly, if the shaded 
area on the left is 0.025, the area to the left of a is 0.025 and a represents the 2.5th percentile, x\i5, 
which equals 0.831. Thus, the critical values are 0.83 and 12.8. 

(c) If the shaded area on the left is 0.10, a represents the 10th percentile, xao, which equals 1.61. 

(d) If the shaded area on the right is 0.01, the area to the left of b is 0.99 and b represents the 99th 
percentile, x%9, which equals 15.1. 

The EXCEL answer to (a) is =CHIINV(0.05,5) or 11.0705. The first parameter in CHIINV is the 
area to the right of the point and the second is the degrees of freedom. The answer to (b) is 
=CHIINV(0.975,5) or 0.8312 and =CHIINV(0.025,5) or 12.8325. The answer to (c) is =CHIINV(0.9,5) 
or 1.6103. The answer to (d) is =CHIINV(0.01,5) or 15.0863. 



11.11 Find the critical values of x for which the area of the right-hand tail of the \ distribution is 0.05, 
if the number of degrees of freedom, v, is equal to (a) 15, (b) 21, and (c) 50. 

SOLUTION 

Using Appendix IV, we find in the column headed x%5 the values (a) 25.0, corresponding to v = 15; 
(b) 32.7, corresponding to v = 21; and (c) 67.5, corresponding to v = 50. 
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11.12 Find the median value of x corresponding to (a) 9, (b) 28, and (c) 40 degrees of freedom. 
SOLUTION 

Using Appendix IV, we find in the column headed x 2 o (since the median is the 50th percentile) the 
values (a) 8.34, corresponding to v — 9; (b) 27.3, corresponding to v = 28; and (c) 39.3, corresponding to 
i/ = 40. 

It is of interest to note that the median values are very nearly equal to the number of degrees of freedom. 
In fact, for v > 10 the median values are equal to (y - 0.7), as can be seen from the table. 



11.13 The standard deviation of the heights of 16 male students chosen at random in a school of 1000 
male students is 2.40 in. Find the (a) 95% and (b) 99% confidence limits of the standard deviation 
for all male students at the school. 

SOLUTION 

(a) The 95% confidence limits are gi ven by sy/N/x.ws and sy^/xms- 

For v = 16 - 1 = 15 degrees of freedom, x^is = 27.5 (or x.975 = 5.24) and x 2 025 = 6.26 (or 
X.025 = 2.50). Then the 95% confidence limits are 2.40\/T6/5.24 and 2.40VT6/2.50 (i.e., 1.83 and 
3.84 in). Thus we can be 95% confident that the population standard deviation lies between 1.83 
and 3.84 in. 

(b) The 99% confidence limits are given by s^/N /x.995 and s/s/N/xnas- 

For ^=16-1 = 15 degrees of freedom, = 32.8 (or x.995 = 5.73) and x 2 005 = 4.60 (or 
X.o<>5 = 2.14). Then the 99% confidence limits are 2.40 v / l6/5.73 and 2.40^/16/2.14 (i.e., 1.68 and 
4.49 in). Thus we can be 99% confident that the population standard deviation lies between 1.68 
and 4.49 in. 



11.14 Find x 2 95 for (a) v = 50 and (b) v = 100 degrees of freedom. 
SOLUTION 

For v greater than 30, we can use the fact that \/2 X 2 — V2v — 1 is very closely normally distributed with 
mean and standard deviation 1. Then if z p is the z-score percentile of the standardized normal distribution, 
we can write, to a high degree of approximation, 

from which xj, = \ (-,, + V2v - l) 2 . 

(a) If v = 50, x 2 95 = \ (z.95 + \/2(50) - l) 2 = \ (1 .64 + v/99) 2 = 67.2, which agrees very well with the value 
of 67.5 given in Appendix IV. 

(b) If v = 100, x 2 95 = \ (z.95 + x/2(100) - l) 2 = j (1.64 + v/199) 2 = 124.0 (actual value = 124.3). 

11.15 The standard deviation of the lifetimes of a sample of 200 electric light bulbs is 100 hours (h). Find 
the (a) 95% and (b) 99% confidence limits for the standard deviation of all such electric light bulbs. 

SOLUTION 

(a) The 95% confidence limits are given by s\/N/x.9is and s^/N/xxm- 

For v = 200 - 1 = 199 degrees of freedom, we find (as in Problem 1 1.14) 

X 2 975 = ^(z.975 + \/2(199)-l) 2 = 1(1.96+ 19.92) 2 = 239 

X 2 025 = 2-(z.025 + \/2(199)- l) 2 = £(-1.96 + 19.92) 2 = 161 

from which x.975 = 15.5 and X 0025 = 12.7. Then the 95% confidence limits are lOOV^OO/lS.S = 91.2h 
and 100\/200/12.7 = 1 1 1.3 h, respectively. Thus we can be 95% confident that the population standard 
deviation will lie between 91.2 and 111.3h. 
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(b) The 99% confidence limits are given by sy/N/x.<m and s^/N/xms- 
For v = 200 - 1 = 199 degrees of freedom, 

X 2 995 = 5(2.995 + V 2 ( m ) ~ !) 2 =1(2.58 + 19.92) 2 = 253 

x'oos = l(z.oo5 + \/2(199)-l) 2 = £(-2.58 + 19.92) 2 = 150 

from which x.995 = 15.9 and x.oos = 12-2. Then the 99% confidence limits are 100\/200/15.9 = 88.9h 
and 100\/200/12.2 = 1 15.9 h, respectively. Thus we can be 99% confident that the population standard 
deviation will lie between 88.9 and 1 1 5.9 h. 



11.16 A manufacturer of axles must maintain a mean diameter of 5.000 cm in the manufacturing 
process. In addition, in order to insure that the wheels fit on the axle properly, it is necessary 
that the standard deviation of the diameters equal 0.005 cm or less. A sample of 20 axles is 
obtained and the diameters are given in Table 1 1 .4. 

Table 11.4 



4.996 



5.002 
5.007 



4.998 



5.003 



5.005 
5.000 



4.999 



4.992 
5.000 



The manufacturer wishes to test the null hypothesis that the population standard deviation is 
0.005 cm versus the alternative hypothesis that the population standard deviation exceeds 
0.005 cm. If the alternative hypothesis is supported, then the manufacturing process must be 
stopped and repairs to the machinery must be made. The test procedure assumes that the axle 
diameters are normally distributed. Test this assumption at the 0.05 level of significance. If you 
are willing to assume normality, then test the hypothesis concerning the population standard 
deviation at the 0.05 level of significance. 
SOLUTION 



The Shapiro-Wilk test of normality is given in Fig. 1 1-8. The large />-value (0.9966) would indicate that 
normality would not be rejected. This probability plot and Shapiro-Wilk analysis was produced by the 
statistical software package STATISTIX. 

We have to decide between the hypothesis: 

H Q :cr = 0.005 cm, and the observed value is due to chance. 
H x :a > 0.005 cm, and the variability is too large. 
The SAS analysis is as follows: 

One Sample Chi-square Test for a Variance 
Sample Statistics for diameter 

N Mean Std. Dev. Variance 

20 5.0006 0.0046 215E-7 



Hypothesis Test 
Null hypothesis: Vari, 
Alternative: Variance 
Chi-square 
16.358 



Df 



19 



.ameter <= 0.000025 
ter > 0.000025 
_ Prob _ 
0.6333 



The large /rvalue (0.6333) would indicate that the null hypothesis would not be rejected. 
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Normality probability plot of diameter 

5.011 - 



1 5.001 

6 

4.996 



4.991 -I 

-2-10 1 2 

Rankits 

Shapiro-Wilk W 0.9890 P(W) 0.9966 20 cases 
Fig. 11-8 Shapiro-Wilk test of normality from STATISTIX. 



11.17 In the past, the standard deviation of weights of certain 40.0-ounce packages filled by a machine 
was 0.25 ounces (oz). A random sample of 20 packages showed a standard deviation of 0.32 oz. Is 
the apparent increase in variability significant at the (a) 0.05 and (b) 0.01 levels? 

SOLUTION 

We have to decide between the hypotheses: 
H : a = 0.25 oz, and the observed result is due to chance. 
H\ : a> 0.25 oz, and the variability has increased. 
The value of x 2 for the sample is 

X 2 = ^ = 2 °( 0J f=32.8 
o- (0.25) 2 

(a) Using a one-tailed test, we would reject H at the 0.05 significance level if the sample value of \ were 
greater than x 2 95, which equals 30.1 for v = 20 — 1 = 19 degrees of freedom. Thus we would reject H 
at the 0.05 significance level. 

(b) Using a one-tailed test, we would reject H at the 0.01 significance level if the sample value of \~ were 
greater than x 2 99, which equals 36.2 for 19 degrees of freedom. Thus we would not reject H at the 0.01 
significance level. 

We conclude that the variability has probably increased. An examination of the machine should 
be made. 



THE F DISTRIBUTION 

11.18 Two samples of sizes 9 and 12 are drawn from two normally distributed populations having 
variances 16 and 25, respectively. If the sample variances are 20 and 8, determine whether the 
first sample has a significantly larger variance than the second sample at significance levels of 
(a) 0.05, (b) 0.01, and (c) Use EXCEL to show that the area to the right of 4.03 is between 0.01 
and 0.05. 
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SOLUTION 

For the two samples, 1 and 2, we have N x =9, N 2 = 12, a\ = 16, a\ = 25, S 2 = 20, and S\ = 8. Thus 

_ g/gj _ N 1 Sj/(N 1 - \)o] = (9)(20)/(9 - 1)(16) = 
S*/a* JV 2 S*/(JV 2 - l)o* (12)(8)/(12-1)(25) " 

The degrees of freedom for the numerator and denominator of F are v x = N x - 1 = 9 - 1 = 8 and 
v 2 = N 2 - 1 = 12 - 1 = 11. Then from Appendix V we find that F 95 = 2.95. Since the calculated 
F = 4.03 is greater than 2.95, we conclude that the variance for sample 1 is significantly larger than 
that for sample 2 at the 0.05 significance level. 

For v x = 8 and v 2 = 1 1, we find from Appendix VI that F m = 4.74. In this case the calculated F = 4.03 
is less than 4.74. Thus we cannot conclude that the sample 1 variance is larger than the sample 2 
variance at the 0.01 significance level. 

The area to the right of 4.03 is given by =FDIST(4.03,8,1 1) or 0.018. 



11.19 Two samples of sizes 8 and 10 are drawn from two normally distributed populations having 
variances 20 and 36, respectively. Find the probability that the variance of the first sample is more 
than twice the variance of the second sample. 

Use EXCEL to find the exact probability that fwith 7 and 9 degrees of freedom exceeds 3.70. 
SOLUTION 

We have N\ = 8, N 2 = 10, a\ = 20, and a 2 2 = 36. Thus 

8^/(7X20) _ Sj 
10S 2 2 /(9)(36) • S 2 2 

The number of degrees of freedom for the numerator and denominator are v x = N x — 1 = 8— 1 = 7 and 
v 2 = N 2 — 1 = 10 — 1 = 9. Now if S 2 is more than twice S 2 , then 

F= 1.85^ >(1.85)(2) = 3.70 

Looking up 3.70 in Appendixes V and VI, we find that the probability is less than 0.05 but greater than 0.01. 
For exact values, we need a more extensive tabulation of the F distribution. 

The EXCEL answer is =FD1ST(3.7,7,9) or 0.036 is the probability that i^with 7 and 9 degrees 
of freedom exceeds 3.70. 



Supplementary Problems 

STUDENT'S t DISTRIBUTION 

11.20 For a Student's distribution with 15 degrees of freedom, find the value of t\ such that (a) the area to the right 
of t x is 0.01, (b) the area to the left of t x is 0.95, (c) the area to the right of t x is 0.10, (d) the combined area to 
the right of t { and to the left of -t x is 0.01, and (e) the area between -/, and t x is 0.95. 



11.21 Find the critical values of t for which the area of the right-hand tail of the t distribution is 0.01 if the number 
of degrees of freedom, v, is equal to (a) 4, (b) 12, (c) 25, (d) 60, and (e) 150 using Appendix III. Give the 
solutions to (a) through (e) using EXCEL. 
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11.22 Find the values of t x for Student's distribution that satisfy each of the following conditions: 

(a) The area between -t x and t x is 0.90 and v = 25. 

(b) The area to the left of -t x is 0.025 and v = 20. 

(c) The combined area to the right of t\ and left of —t\ is 0.01 and v = 5. 

(d) The area to the right of t x is 0.55 and v = 16. 

11.23 If a variable U has a Student's distribution with v = 10, find the constant C such that (a) Pr{ [/ > C} = 0.05, 
(/>) Pr{-C < U < C} = 0.98, (c) Pr{f/ < C} = 0.20, and (rf) Pr{[7 > C} = 0.90. 



11.24 The 99% confidence coefficients (two-tailed) for the normal distribution are given by ±2.58. What are the 
corresponding coefficients for the t distribution if (a) v = 4, (b) v = 12, (c) v = 25, (d) v = 30, and (<?) u = 40? 

11.25 A sample of 12 measurements of the breaking strength of cotton threads gave a mean of 7.38 grams (g) and a 
standard deviation of 1.24 g. Find the (a) 95%, (b) 99% confidence limits for the actual breaking strength, 
and (c) the MINITAB solutions using the summary statistics. 

11.26 Work Problem 1 1.25 by assuming that the methods of large sampling theory are applicable, and compare the 
results obtained. 



11.27 Five measurements of the reaction time of an individual to certain stimuli were recorded as 0.28, 0.30, 0.27, 
0.33, and 0.31 seconds. Find the (a) 95% and (b) 99% confidence limits for the actual reaction time. 

11.28 The mean lifetime of electric light bulbs produced by a company has in the past been 1 120 h with a standard 
deviation of 125 h. A sample of eight electric light bulbs recently chosen from a supply of newly produced 
bulbs showed a mean lifetime of 1070 h. Test the hypothesis that the mean lifetime of the bulbs has not 
changed, using significance levels of (a) 0.05 and (b) 0.01. 

11.29 In Problem 11.28, test the hypothesis 1 120 h against the alternative hypothesis ^< 1 120 h, using 
significance levels of (a) 0.05 and (b) 0.01. 

11.30 The specifications for the production of a certain alloy call for 23.2% copper. A sample of 10 analyses of the 
product showed a mean copper content of 23.5% and a standard deviation of 0.24%. Can we conclude at 
(a) 0.01 and (b) 0.05 significance levels that the product meets the required specifications? 

11.31 In Problem 1 1.30, test the hypothesis that the mean copper content is higher than in the required specifica- 
tions, using significance levels of (a) 0.01 and (b) 0.05. 

11.32 An efficiency expert claims that by introducing a new type of machinery into a production process, he can 
substantially decrease the time required for production. Because of the expense involved in maintenance of 
the machines, management feels that unless the production time can be decreased by at least 8.0%, it cannot 
afford to introduce the process. Six resulting experiments show that the time for production is decreased by 
8.4% with a standard deviation of 0.32%. Using significance levels of (a) 0.01 and (b) 0.05, test the 
hypothesis that the process should be introduced. 

11.33 Using brand A gasoline, the mean number of miles per gallon traveled by five similar automobiles under 
identical conditions was 22.6 with a standard deviation of 0.48. Using brand B, the mean number was 21.4 
with a standard deviation of 0.54. Using a significance level of 0.05, investigate whether brand A is really 
better than brand B in providing more mileage to the gallon. 
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11.34 Two types of chemical solutions, A and B, were tested for their pH (degree of acidity of the solution). 
Analysis of six samples of A showed a mean pH of 7.52 with a standard deviation of 0.024. Analysis of 
five samples of B showed a mean pH of 7.49 with a standard deviation of 0.032. Using the 0.05 significance 
level, determine whether the two types of solutions have different pH values. 

11.35 On an examination in psychology, 12 students in one class had a mean grade of 78 with a standard deviation 
of 6, while 15 students in another class had a mean grade of 74 with a standard deviation of 8. Using a 
significance level of 0.05, determine whether the first group is superior to the second group. 



THE CHI-SQUARE DISTRIBUTION 

11.36 For a chi-square distribution with 12 degrees of freedom, use Appendix IV to find the value of x\ such that 

(a) the area to the right of xl is 0.05, (b) the area to the left of xl is 0.99, (c) the area to the right of xl is 
0.025, and (d) find (a) through (c) using EXCEL. 

11.37 Find the critical values of \ 2 for which the area of the right-hand tail of the x 2 distribution is 0.05 if the 
number of degrees of freedom, v, is equal to (a) 8, (b) 19, (c) 28, and (d) 40. 

11.38 Work Problem 11.37 if the area of the right-hand tail is 0.01. 

11.39 (a) Find \[ and xl such that the area under the \ 2 distribution corresponding to v = 20 between Xi an d xl 

is 0.95, assuming equal areas to the right of xl and left of \\ . 

(b) Show that if the assumption of equal areas in part (a) is not made, the values \\ and xl are not unique. 

11.40 If the variable U is chi-square distributed with v = l, find Xi and xl such that (a) Pr{[7 > xl} = 0-025, 
(b) Pr{[7 < x?} = 0.50, and (c) Pr{x? < U < xl} = 0.90. 

11.41 The standard deviation of the lifetimes of 10 electric light bulbs manufactured by a company is 120 h. Find 
the (a) 95% and (b) 99% confidence limits for the standard deviation of all bulbs manufactured by the 
company. 

11.42 Work Problem 11.41 if 25 electric light bulbs show the same standard deviation of 120 h. 

11.43 Find (a)x\ s and (b) x\s for v= 150 using x]> =\( Z P + V2u - l) 2 and (c) compare the answers when 
EXCEL is used. 



11.44 Find (a) Xms and (b) ;\ : 975 for v = 250 using xj = \i z p + V2u - l) 2 and (c) compare the answers when 
EXCEL is used. 

11.45 Show that for large values of v, a good approximation to x 2 is given by (v + z p \Fh/). where z p is the pth 
percentile of the standard normal distribution. 

11.46 Work Problem 1 1 .39 by using the x 2 distributions if a sample of 100 electric bulbs shows the same standard 
deviation of 120 h. Compare the results with those obtained by the methods of Chapter 9. 

11.47 What is the 95% confidence interval of Problem 11.44 that has the least width? 
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11.48 The standard deviation of the breaking strengths of certain cables produced by a company is given as 
240 pounds (lb). After a change was introduced in the process of manufacture of these cables, the breaking 
strengths of a sample of eight cables showed a standard deviation of 300 lb. Investigate the significance of the 
apparent increase in variability, using significance levels of (a) 0.05 and (b) 0.01. 

1 1.49 The standard deviation of the a nnual temperatures of a city over a period of 100 years was 16° Fahrenheit. 
Using the mean temperature on the 15th day of each month during the last 15 years, a standard deviation 
of annual temperatures was computed as 10° Fahrenheit. Test the hypothesis that the temperatures in the 
city have become less variable than in the past, using significance levels of (a) 0.05 and (b) 0.01. 

THE F DISTRIBUTION 

11.50 Find the values of F in parts a, b, c, and d using Appendix V and VI. 



(«) 


F .9 


5 with V\ 


= 8 and V 2 -- 


= 10. 


(b) 


F .g 


9 with V\ 


= 24 and V 2 


= 11. 


(c) 


Fo.s 


5 with N x 


= 16 and N 2 


= 25. 


(d) 


Fq.9 


o with Ni 


= 21 and N 2 


= 23. 



11.51 Solve Problem 1 1.50 using EXCEL. 

11.52 Two samples of sizes 10 and 15 are drawn from two normally distributed populations having variances 40 
and 60, respectively. If the sample variances are 90 and 50, determine whether the sample 1 variance is 
significantly greater than the sample 2 variance at significance levels of (a) 0.05 and (b) 0.01. 

11.53 Two companies, A and B, manufacture electric light bulbs. The lifetimes for the A and B bulbs are very 
nearly normally distributed, with standard deviations of 20 h and 27 h, respectively. If we select 16 bulbs 
from company A and 20 bulbs from company B and determine the standard deviations of their lifetimes to 
be 15 h and 40 h, respectively, can we conclude at significance levels of (a) 0.05 and (b) 0.01 that the 
variability of the A bulbs is significantly less than that of the B bulbs? 



The Chi-Square Test 



OBSERVED AND THEORETICAL FREQUENCIES 

As we have already seen many times, the results obtained in samples do not always agree exactly 
with the theoretical results expected according to the rules of probability. For example, although 
theoretical considerations lead us to expect 50 heads and 50 tails when we toss a fair coin 100 times, 
it is rare that these results are obtained exactly. 

Suppose that in a particular sample a set of possible events E u E 2 , E 3 , . . . , E k (see Table 12.1) are 
observed to occur with frequencies 0\, o 2 , o 3 , . . . , o k , called observed frequencies, and that according to 
probability rules they are expected to occur with frequencies e x , e 2 , e 3 , . . . , e k , called expected, or theo- 
retical, frequencies. Often we wish to know whether the observed frequencies differ significantly from the 
expected frequencies. 



Table 12.1 





Ei 


E 2 






E k 


Observed frequency 










Ok 


Expected frequency 













DEFINITION OF % 2 

A measure of the discrepancy existing between the observed and expected frequencies is supplied by 
the statistic \ (read chi-square) given by 

x 2 = i^L i (°2-e2) 2 , , {o k -e k )\ .£ (Qj-ej) 2 (;) 
where if the total frequency is N, 

E«; = E ej = N (2) 
An expression equivalent to formula (7) is (see Problem 12.11) 

X 2 = E°^~N (3) 

If ^ =0, the observed and theoretical frequencies agree exactly; while if x > 0, they do not agree 
exactly. The larger the value of x > the greater is the discrepancy between the observed and expected 
frequencies. 
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The sampling distribution of \ is approximated very closely by the chi-square distribution 

Y=Y ( x y^-%-^=Y oX ^e-^ 2 (4) 

(already considered in Chapter 11) if the expected frequencies are at least equal to 5. The approximation 
improves for larger values. 

The number of degrees of freedom, v, is given by 

(1) v = k - 1 if the expected frequencies can be computed without having to estimate the population 
parameters from sample statistics. Note that we subtract 1 from k because of constraint condition 
(2), which states that if we know k - 1 of the expected frequencies, the remaining frequency can be 
determined. 

(2) v — k - 1 - m if the expected frequencies can be computed only by estimating m population 
parameters from sample statistics. 

SIGNIFICANCE TESTS 

In practice, expected frequencies are computed on the basis of a hypothesis H . If under this 
hypothesis the computed value of \ 2 given by equation (/) or (3) is greater than some critical value 
(such as x%5 or x 2 99, which are the critical values of the 0.05 and 0.01 significance levels, respectively), we 
would conclude that the observed frequencies differ significantly from the expected frequencies and 
would reject H at the corresponding level of significance; otherwise, we would accept it (or at least 
not reject it). This procedure is called the chi-square test of hypothesis or significance. 

It should be noted that we must look with suspicion upon circumstances where x is too close to zero, 
since it is rare that observed frequencies agree too well with expected frequencies. To examine such 
situations, we can determine whether the computed value of x 2 is less than x 2 05 or x%i, in which cases we 
would decide that the agreement is too good at the 0.05 or 0.01 significance levels, respectively. 

THE CHI-SQUARE TEST FOR GOODNESS OF FIT 

The chi-square test can be used to determine how well theoretical distributions (such as the normal 
and binomial distributions) fit empirical distributions (i.e., those obtained from sample data). 
See Problems 12.12 and 12.13. 

EXAMPLE 1. A pair of dice is rolled 500 times with the sums in Table 12.2 showing on the dice: 



Table 12.2 



Sum 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


Observed 


15 


35 


49 


58 


65 


76 


72 


60 


35 


29 


6 


The expected number, if the dice are 




determined from the distribution of 
Table 12.3 


x as in Table 12.3. 






2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


p(x) 


1/36 


2/36 


3/36 


4/36 


5/36 


6/36 


5/36 


4/36 


3/36 


2/36 


1/36 


We have the observed and expected frequencie 


s in Table 12.4. 
Table 12.4 












Observed 


15 


35 


49 


58 


65 


76 


72 


60 


35 


29 


6 


Expected 


13.9 


27.8 


41.7 


55.6 


69.5 


83.4 


69.5 


55.6 


41.7 


27.8 


13.9 
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If the observed and expected are entered into B1:L2 in the EXCEL worksheet, the expression =(B1-B2) A 2/B2 
is entered into B4, a click-and-drag is executed from B4 to L4, and then the quantities in B4:L4 are summed we 
obtain 10.34 for x 2 = E, ((o, - ejf/ej). 

The /rvalue corresponding to 10.34 is given by the EXCEL expression =CHIDIST(10. 34,10). The ^-value is 
0.411. Because of this large />- value, we have no reason to doubt the fairness of the dice. 



CONTINGENCY TABLES 

Table 12.1, in which the observed frequencies occupy a single row, is called a one-way classification 
table. Since the number of columns is k, this is also called alxi (read "1 by k") table. By extending 
these ideas, we can arrive at two-way classification tables, or h x k tables, in which the observed 
frequencies occupy h rows and k columns. Such tables are often called contingency tables. 

Corresponding to each observed frequency in an It x k contingency table, there is an expected (or 
theoretical) frequency that is computed subject to some hypothesis according to rules of probability. 
These frequencies, which occupy the cells of a contingency table, are called cell frequencies. The total 
frequency in each row or each column is called the marginal frequency. 

To investigate agreement between the observed and expected frequencies, we compute the statistic 

, 2 = E^ (5) 

where the sum is taken over all cells in the contingency table and where the symbols o, and e, represent, 
respectively, the observed and expected frequencies in the y'th cell. This sum, which is analogous to 
equation (7), contains hk terms. The sum of all observed frequencies is denoted by N and is equal to 
the sum of all expected frequencies [compare with equation (2)]. 

As before, statistic (5) has a sampling distribution given very closely by (4), provided the expected 
frequencies are not too small. The number of degrees of freedom, v, of this chi-square distribution is 
given for h > 1 and k > 1 by 

1 . v = (h - 1) (k - 1) if the expected frequencies can be computed without having to estimate popula- 
tion parameters from sample statistics. For a proof of this, see Problem 12.18. 

2. v — (h - l)(k - 1) - m if the expected frequencies can be computed only by estimating m population 
parameters from sample statistics. 

Significance tests for h x k tables are similar to those for 1 x k tables. The expected frequencies are 
found subject to a particular hypothesis H . A hypothesis commonly assumed is that the two classifica- 
tions are independent of each other. 

Contingency tables can be extended to higher dimensions. Thus, for example, we can have hx k x I 
tables, where three classifications are present. 

EXAMPLE 2. The data in Table 12.5 were collected on how individuals prepared their taxes and their education 
level. The null hypothesis is that the way people prepare their taxes (computer software or pen and paper) is 
independent of their education level. Table 12.5 is a contingency table. 



Table 12.5 





Education Level 


Tax prep. 


High school 


Bachelors 


Masters 


computer 


23 


35 


42 


Pen and paper 


45 


30 


25 



If MINITAB is used to analyze this data, the following results are obtained. 
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Chi-Square Test: highschool, bachelors, masters 

Expected counts are printed below observed counts 
Chi-Square contributions are printed below expected c 



highschool bachelo: 



34.00 
3 .559 



32.50 
0.192 



33.50 
2 .157 



Total 
100 



34.00 
3 .559 



32.50 
0.192 



33 . 50 
2 .157 



Total 68 
Chi-Sq = 11.816, DF = 



2 , P-Value = .003 



Because of the small p-value, the hypothesis of independence would be rejected and we would 
conclude that tax preparation would be contingent upon education level. 



YATES' CORRECTION FOR CONTINUITY 

When results for continuous distributions are applied to discrete data, certain corrections for con- 
tinuity can be made, as we have seen in previous chapters. A similar correction is available when the 
chi-square distribution is used. The correction consists in rewriting equation (7) as 

(corrected) - - *' - °- 5 >' + <"* - *l - °- 5) ' + ■ ■ ■ + fez 'J ~ ■»>' (6) 
e\ e 2 e k 

and is often referred to as Yates'' correction. An analogous modification of equation (5) also exists. 

In general, the correction is made only when the number of degrees of freedom isu=\. For large 
samples, this yields practically the same results as the uncorrected \, but difficulties can arise 
near critical values (see Problem 12.8). For small samples where each expected frequency is between 
5 and 10, it is perhaps best to compare both the corrected and uncorrected values of \ ■ If both values 
lead to the same conclusion regarding a hypothesis, such as rejection at the 0.05 level, difficulties are 
rarely encountered. If they lead to different conclusions, one can resort to increasing the sample sizes or, 
if this proves impractical, one can employ methods of probability involving the multinomial distribution 
of Chapter 6. 



SIMPLE FORMULAS FOR COMPUTING Z 2 

Simple formulas for computing \ that involve only the observed frequencies can be derived. 
The following gives the results for 2 x 2 and 2x3 contingency tables (see Tables 12.6 and 12.7, 
respectively). 

2x2 Tables 

2= N(a l b 2 -a 2 b i ) 2 = NA 2 

X (a l +b l )(a 2 + b 2 )(a l +a 2 )(b l +b 2 ) N X N 2 N A N B U 
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where A = a l b 2 - a 2 b l , N = a x + a 2 + b x + b 2 , N l =a 1 +b 1 , N 2 = 
n b = °\ + b 2 (see Problem 12.19). With Yates' correction, this becomes 



X (corrected) = 



N{\ ai b 2 -aM-\Nf 



bi> N A = + a 2 , and 
7V(|A|-I^V) 2 



+ M(«2 + 6 2 )(fli + a 2 )(6i + b 2 ) N X N 2 N A N B 



2x3 Tables 



X ^ [jV; N 2 N 3 \ N b [A^i ' 7V 2 ' 7V 3 J 
where we have used the general result valid for all contingency tables (see Problem 12.43): 

Result (9) for 2 x k tables where k > 3 can be generalized (see Problem 12.46). 



(8) 



COEFFICIENT OF CONTINGENCY 

A measure of the degree of relationship, association, or dependence of the classifications in i 
contingency table is given by 



C = 



{11) 



which is called the coefficient of contingency. The larger the value of C, the greater is the degree of 
association. The number of rows and columns in the contingency table determines the maximum value of 
C, which is never greater than 1 . If the number of rows and columns of a contingency table is equal to k, 
the maximum value of C is given by \/(k— l)/k (see Problems 12.22, 12.52, and 12.53). 



Find the coefficient of c< 



lgency for Example 2. 



C= ! %1 I UM6 -02 
Vx 2 + N V 11.816 + 200 



CORRELATION OF ATTRIBUTES 

Because classifications in a contingency table often describe characteristics of individuals or objects, 
they are often referred to as attributes, and the degree of dependence, association, or relationship is 
called the correlation of attributes. For kx k tables, we define 



(12) 
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as the correlation coefficient between attributes (or classifications). This coefficient lies between and 1 
(see Problem 12.24). For 2x2 tables in which k = 2, the correlation is often called tetrachoric 
correlation. 

The general problem of correlation of numerical variables is considered in Chapter 14. 



ADDITIVE PROPERTY OF / 2 

Suppose that the results of repeated experiments yield sample values of \ 2 given by xi, X2> xl> ■ ■ ■ 
with u { , v 2 , v } , . . . degrees of freedom, respectively. Then the result of all these experiments can 

be considered equivalent to a \ ~ value given by xi + x\ + x\ H with v \ + v i + v t, H degrees of 

freedom (see Problem 12.25). 



Solved Problems 



THE CHI-SQUARE TEST 



12.1 In 200 tosses of a coin, 115 heads and 85 tails were observed. Test the hypothesis that the coin is 
fair, using Appendix IV and significance levels of (a) 0.05 and (to) 0.01. Test the hypothesis by 
computing the p-value and (c) comparing it to levels 0.05 and 0.01. 

SOLUTION 

The observed frequencies of heads and tails are o x = 115 and o 2 = 85, respectively, and the expected 
frequencies of heads and tails (if the coin is fair) are e x = 100 and <? 2 = 100, respectively. Thus 

v 2_ (oi~ei) 2 , ( O2 -g2) 2 _ (H5 - 100) 2 (85 - 100) 2 
X e x e 2 100 100 

Since the number of categories, or classes (heads, tails), is k = 2, v = k-\ = 2-\ = \. 

(a) The critical value xm for 1 degree of freedom is 3.84. Thus, since 4.50 > 3.84, we reject the hypothesis 
that the coin is fair at the 0.05 significance level. 

(b) The critical value x 2 99 for 1 degree of freedom is 6.63. Thus, since 4.50 < 6.63, we cannot reject the 
hypothesis that the coin is fair at the 0.02 significance level. 

We conclude that the observed results are probably significant and that the coin is probably not fair. 
For a comparison of this method with previous methods used, see Problem 12.3. 

Using EXCEL, the />-value is given by =CHIDIST(4.5,1), which equals 0.0339. And we see, using the 
^-value approach that the results are significant at 0.05 but not at 0.01. Either of these methods of testing 
may be used. 



12.2 Work Problem 12.1 by using Yates' correction. 
SOLUTION 

v 2 fcorrectecn - - g| 1 - ^ + (l ° 2 - ^ - ^ - (I 115 - 100| — 0-5) 2 (|85 - 100| - 0.5) 2 
X (corrected)- - + - - — + — 

= ^ + ^ = 4.205 
100 100 

Since 4.205 > 3.84 and 4.205 < 6.63, the conclusions reached in Problem 12.1 are valid. For a comparison 
with previous methods, see Problem 12.3. 
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12.3 Work Problem 12.1 by using the normal approximation to the binomial distribution. 
SOLUTION 

Under the hypothesis that the coin is fair, the mean and standard deviation of the number of heads 
expected in 200 tosses of a coin are y, = Np = (200)(0.5) = 100 and a = ^/Npq = v /(200)(0.5)(0.5) = 7.07, 
respectively. 

First method 

115 heads in standard units = 115 ? 07 100 = 2.12 

Using the 0.05 significance level and a two-tailed test, we would reject the hypothesis that the coin is fair 
if the z score were outside the interval —1.96 to 1.96. With the 0.01 level, the corresponding interval would be 
-2.58 to 2.58. It follows (as in Problem 12.1) that we can reject the hypothesis at the 0.05 level but cannot 
reject it at the 0.01 level. 

Note that the square of the above standard score, (2.12) 2 = 4.50, is the same as the value of \ 2 obtained 
in Problem 12.1. This is always the case for a chi-square test involving two categories (see Problem 12.10). 

Second method 

Using the correction for continuity, 1 15 or more heads is equivalent to 1 14.5 or more heads. Then 1 14.5 
in standard units = (114.5 - 100)/7.07 = 2.05. This leads to the same conclusions as in the first method. 

Note that the square of this standard score is (2.05) 2 = 4.20, agreeing with the value of \ 2 corrected for 
continuity by using Yates' correction of Problem 12.2. This is always the case for a chi-square test involving 
two categories in which Yates' correction is applied. 



12.4 Table 12.8 shows the observed and expected frequencies in tossing a die 120 times. 

(a) Test the hypothesis that the die is fair using a 0.05 significance level by calculating \ 2 and 
giving the 0.05 critical value and comparing the computed test statistic with the critical 
value. 

(b) Compute the /rvalue and compare it with 0.05 to test the hypothesis. 



Die face 


1 




3 


4 


5 


6 


Observed frequency 


25 


17 


15 


23 


24 


16 


Expected frequency 


20 


20 


20 


20 


20 


20 


-e,f , (o 2 -e 2 f , (o 3 - 




o 4 -e 4 ) 2 (o 




+ K- 




e, e 2 e 3 






e 5 


e 6 





_ (25 - 20) 2 (17- 20) 2 (15 - 20)" (23 - 20)" (24 - 20)- (16-20) 2 



20 



20 



20 



20 



20 



20 



(a) The 0.05 critical value is given by the EXCEL expression =CHIINV(0.05, 5) or 1 1 .0705. The computed 
value of the test statistic is 5.00. Since the computed test statistic is not in the 0.05 critical region, do not 
reject the null that the die is fair. 

(b) The /7-value is given by the EXCEL expression =CHIDIST(5.00, 5) or 0.4159. Since the Rvalue is not 
less than 0.05, do not reject the null that the die is fair. 
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12.5 Table 12.9 shows the distribution of the digits 0, 1, 2, . . . , 9 in a random-number table of 
250 digits, (a) Find the value of the test statistic x, (b) find the 0.01 critical value and give 
your conclusion for a = 0.01, and (c) find the /rvalue for the value you found in (a) and give 
your conclusion for a = 0.01. 



Table 12.9 



Digit 





1 






4 


5 


6 


7 


8 


9 


Observed frequency 


17 


31 


29 


18 


14 


20 


35 


30 


20 


36 


Expected frequency 


25 


25 


25 


25 


25 


25 


25 


25 


25 


25 



SOLUTION 

. 2 _ (17 - 25) 2 (31 - 25) 2 (29 - 25) 2 (18 - 25) 2 (36 - 25) 2 _ 

X " 25 + 25 + 25 + 25 25 
The 0.01 critical value is given by =CHIINV(0.01, 9) which equals 21.6660. Since the computed value of 
X 2 exceeds this value, we reject the hypothesis that the numbers are random. 

The /;-value is given by the EXCEL expression =CHIDIST(23.3, 9) which equals 0.0056, which is less 
than 0.01. By the p- value technique, we reject the null. 



12.6 In his experiments with peas, Gregor Mendel observed that 315 were round and yellow, 108 were 
round and green, 101 were wrinkled and yellow, and 32 were wrinkled and green. According to 
his theory of heredity, the numbers should be in the proportion 9 : 3 : 3 : 1 . Is there any evidence to 
doubt his theory at the (a) 0.01 and (b) 0.05 significance levels? 



The total number of peas is 315 + 108 + 101 + 32 = 556. Since the expected numbers are in the propor- 
tion 9:3:3:1 (and 9 + 3 + 3 + 1 = 16), we would expect 

^(556) = 3 12.75 round and yellow ^(556) = 104.25 wrinkled and yellow 

^ (556) = 104.25 round and green ± (556) = 34.75 wrinkled and green 

2 _ (315 - 312.75) 2 (108 - 104.25) 2 (101 - 104.25) 2 (32 - 34.75) 2 _ 
X ~ 31Z75 + 10425 + 10i25 + 34J5 " °- 47 ° 
Since there are four categories, k = 4 and the number of degrees of freedom is v = 4 - 1 = 3. 

(a) For v = 3, = 1 1-3, and thus we cannot reject the theory at the 0.01 level. 

(b) For v = 3, x 2 95 = 7.81, and thus we cannot reject the theory at the 0.05 level. 
We conclude that the theory and experiment are in agreement. 

Note that for 3 degrees of freedom, x%5 = 0-352 and x = 0-470 > 0.352. Thus, although the 
t is good, the results obtained are subject to a reasonable amount of sampling error. 



12.7 An urn contains a very large number of marbles of four different colors: red, orange, yellow, and 
green. A sample of 12 marbles drawn at random from the urn revealed 2 red, 5 orange, 4 yellow, 
and 1 green marble. Test the hypothesis that the urn contains equal proportions of the differently 
colored marbles. 

SOLUTION 

Under the hypothesis that the urn contains equal proportions of the differently colored marbles, 
we would expect 3 of each kind in a sample of 12 marbles. Since these expected numbers are less than 5, 
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the chi-square approximation will be in error. To avoid this, we combine categories so that the expected 
number in each category is at least 5. 

If we wish to reject the hypothesis, we should combine categories in such a way that the evidence against 
the hypothesis shows up best. This is achieved in our case by considering ihc categories "red or green" and 
"orange or yellow," for which the sample revealed 3 and 9 marbles, respectively. Since the expected number 
in each category under the hypothesis of equal proportions is 6, we have 

For v = 2 - 1 = 1, x 2 95 = 3.84. Thus we cannot reject the hypothesis at the 0.05 significance level 
(although we can at the 0.10 level). Conceivably the observed results could arise on the basis of chance 
even when equal proportions of the colors are present. 

Another method 

Using Yates' correction, we find 

2 (j3-6| - 0.5) 2 (|9 - 6| - 0.5) 2 (2.5) 2 (2.5) 2 
X 6 + 6 6 + 6 

which leads to the same conclusion given above. This is to be expected, of course, since Yates' correction 
always reduces the value of \ . 

It should be noted that if the \ approximation is used despite the fact that the frequencies are too 
small, we would obtain 

2 = (2^3f + (i^3f = 33 3 

X 3 + 3 + 3 + 3 

Since for v = 4 - 1 = 3, x; 9i = 7.81, we would arrive at the same conclusions as above. Unfortunately, 
the x 2 approximation for small frequencies is poor; hence, when it is not advisable to combine frequencies, 
we must resort to the exact probability methods of Chapter 6. 

12.8 In 360 tosses of a pair of dice, 74 sevens and 24 elevens are observed. Using the 0.05 significance 
level, test the hypothesis that the dice are fair. 

SOLUTION 

A pair of dice can fall 36 ways. A seven can occur in 6 ways, an eleven in 2 ways. Then 
Pr{seven} = ^ = i and Pr {eleven} = ^ = i Thus in 360 tosses we would expect ^(360) = 60 sevens and 
^ (360) = 20 elevens, so that 

2 _ (74-60f (24-20) 2 
x 60 20 

For v = 2 — 1 = 1, x 2 95 = 3.84. Thus, since 4.07 > 3.84, we would be inclined to reject the hypothesis 
that the dice are fair. Using Yates' correction, however, we find 

(comcM) _ tlW-^-M- + (P4-20I-0.5)' _ ^ + _ 3 M 

Thus on the basis of the corrected \ 2 we could not reject the hypothesis at the 0.05 level. 

In general, for large samples such as we have here, results using Yates' correction prove to be 
more reliable than uncorrected results. However, since even the corrected value of x lies so close to the 
critical value, we are hesitant about making decisions one way or the other. In such cases it is perhaps best 
to increase the sample size by taking more observations if we are interested especially in the 0.05 level 
for some reason; otherwise, we could reject the hypothesis at some other level (such as 0.10) if this is 
satisfactory. 



12.9 A survey of 320 families with 5 children revealed the distribution shown in Table 12.10. Is the 
result consistent with the hypothesis that male and female births are equally probable? 
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Table 12.10 



Number of 
boys and girls 


5 boys 
girls 


4 boys 


3 boys 
2 girls 




1 boy 
4 girls 


boys 


Total 


Number of families 






110 




40 







SOLUTION 

Let p = probability of a male birth, and let q = 1 - p = probability of a female birth. Then the prob- 
abilities of (5 boys), (4 boys and 1 girl), . . . , (5 girls) are given by the terms in the binomial expansion 

( p + qf = p 5 + 5p 4 q + I0p 3 q 2 + 10pV + W + ^ 

If p — q = \, we have 

Pr{5 boys and girls} = (i) 5 = i Pr{2 boys and 3 girls} = 10(i) 2 (i) 3 = ±§ 

Pr{4 boys and 1 girl} = 5(i) 4 © = £ Pr{l boy and 4 girls} = 5(i)0 4 = | 

Pr{3 boys and 2 girls} = 10(|) 3 (i) 2 = f 2 Pr{0 boys and 5 girls} = @ 5 = i 

Then the expected number of families with 5, 4, 3, 2, 1, and boys are obtained by multiplying the above 
probabilities by 320, and the results are 10, 50, 100, 100, 50, and 10, respectively. Hence 

2 _ (18 - 10) 2 (56- 50) 2 (110 - 100) 2 (88 - 100) 2 (40 - 50) 2 (8 - 10) 2 
X " 10 + 50 + 100 + 100 + 50 + 10 ~ 
Since \ K = 11.1 and \ m — 15.1 for v = 6 — 1 — 5 degrees of freedom, we can reject the hypothesis at 

the 0.05 but not at the 0.01 significance level. Thus we conclude that the results are probably significant and 

male and female births are not equally probable. 



12.10 In a survey of 500 individuals, it was found that 155 of the 500 rented at least one video from a 
video rental store during the past week. Test the hypothesis that 25% of the population rented at 
least one video during the past week using a two-tailed alternative and a = 0.05. Perform the test 
using both the standard normal distribution and the chi-square distribution. Show that the 
chi-square test involving only two categories is equivalent to the significance test for proportions 
given in Chapter 10. 

SOLUTION 

If the null hypothesis is true, then l i = Np = 500(0.25) = 125 and a = y/Npq = x /500(0.25)(0.75) = 9.68. 
The computed test statistic is Z = (155 - 125)/9.68 = 3.10. The critical values are ±1.96, and the null 
hypothesis is rejected. 

The solution using the chi-square distribution is found by using the results as displayed in Table 12.11. 



Table 12.11 



Frequency 


Rented Video 


Did Not Rent Video 


Total 


Observed 


155 


345 


500 


Expected 


125 


375 


500 



The computed chi-square statistic is determined as follows: 

2 _ (155 - 125) 2 (345 - 375) 2 
X 125 + 375 

The critical value for one degree of freedom is 3.84, and the null hypothesis is rejected. Note that 
(3.10) 2 = 9.6 and (±1.96) 2 = 3.84 or Z 2 = x . The two procedures are equivalent. 
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12.11 (a) Prove that formula (7) of this chapter can be written 

x 2 = E °f-N 

(b) Use the result of part (a) to verify the value of \ computed in Problem 12.6. 
SOLUTION 

(a) By definition, 

= E|-2E» i + E^ = E|-2A' + A' = E|-A' 

where formula (2) of this chapter has been used. 

(ft) x^V^A^^ + ^ + ^ + ^ 556 ^0.470 

{ > X ^ ej 312.75 104.25 104.25 34.75 



GOODNESS OF FIT 



12.12 A racquetball player plays 3 game sets for exercise several times over the years and keeps records 
of how he does in the three game sets. For 250 days, his records show that he wins games on 
25 days, 1 game on 75 days, 2 games on 125 days, and wins all 3 games on 25 days. Test that 
Z=the number of wins in a 3 game series is binomial distributed at a = 0.05. 

SOLUTION 

The mean number of wins in 3 game sets is (0 x 25 + 1 x 75 + 2 x 125 + 3 x 25)/250 = 1.6. If X is 
binomial, the mean is np = 3p which is set equal to the statistic 1.6 and solving for p, we find that p = 0.53. 
We wish to test that Xis binomial with n = 3 and /> = 0.53. If Xis, binomial with p = 0.53, the distribution of 
X and the expected number of wins is shown in the following EXCEL output. Note that the binomial 
probabilities, p(x) are found by entering =BINOMDlST(A2, 3, 0.53, 0) and performing a click-and-drag 
from B2 to B5. This gives the values shown under p(x). 

x p(x) expected wins observed wins 

0.103823 25.95575 25 

1 0.351231 87.80775 75 

2 0.396069 99.01725 125 

3 0.148877 37.21925 25 



The expected wins are found by multiplying the p(x) values by 250. 

2 ^ (25 - 30.0) 2 (7S- 87.8) 2 (125 - 99.0) 2 (25 - 37.2) 2 
X 30.0 87.8 99.0 37.2 ' ' 

Since the number of parameters used in estimating the expected frequencies is m = 1 (namely, the 
parameter p of the binomial distribution), v = k- \- m = A- \- \ = 2. The p-value is given by the 
EXCEL expression =CHIDIST(12.73, 2) = 0.0017 and the hypothesis that the variable X is binomial 
distributed is rejected. 



12.13 The number of hours per week that 200 college students spend on the Internet is grouped into the 
classes to 3, 4 to 7, 8 to 11, 12 to 15, 16 to 19, 20 to 23, and 24 to 27 with the observed 
frequencies 12, 25, 36, 45, 34, 31, and 17. The grouped mean and the grouped standard deviation 
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are found from the data. The null hypothesis is that the data are normally distributed. Using the 
mean and the standard deviation that are found from the grouped data and assuming a normal 
distribution, the expected frequencies are found, after rounding off, to be the following: 10, 30, 
40, 50, 36, 28, and 6.t 

(a) Find x 2 - 

(b) How many degrees of freedom does x 2 have? 

(c) Use EXCEL to find the 5% critical value and give your conclusion at 5%. 

(d) Use EXCEL to find the /rvalue for your result. 

SOLUTION 

(a) A portion of the EXCEL worksheet is shown in Fig. 12-1. =(A2-B2) A 2/B2 is entered into C2 and a 
click-and-drag is executed from C2 to C8. =SUM(C2:C8) is entered into C9. We see that x = 22.7325. 



Fig. 12-1 Portion of EXCEL worksheet for Problem 12.13. 



(b) Since the number of parameters used in estimating the expected frequencies is m = 2 (namely, the mean 
u and the standard deviation a of the normal distribution), v = k - 1 - m = 7 - 1 - 2 = 4. Note that 
no classes needed to be combined, since the expected frequencies all exceeded 5. 

(c) The 5% critical value is given by =CHIINV(0.05, 4) or 9.4877. Reject the null hypothesis that the data 
came from a normal distribution since 22.73 exceeds the critical value. 

(d) The p-value is given by =CHIDIST(22.7325, 4) or we have p-value = 0.000143. 



CONTINGENCY TABLES 

12.14 Work Problem 10.20 by using the chi-square test. Also work using MINITAB and compare the 
two solutions. 
SOLUTION 

The conditions of the problem are presented in Table 12.12(a). Under the null hypothesis H that the 
serum has no effect, we would expect 70 people in each of the groups to recover and 30 in each group not to 
recover, as shown in Table 12.12(6). Note that H is equivalent to the statement that recovery is independent 
of the use of the serum (i.e., the classifications are independent). 



Table 12.12(a) Frequencies Observed 





Recover 


Do Not 
Recover 


Total 


Group A (using serum) 


75 


25 


100 


Group B (not using serum) 


65 


35 


100 


Total 


140 


60 


200 
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Table 12.12(6) Frequencies Expected under H 





Recover 


Do Not 
Recover 


Total 


Group A (using serum) 


70 


30 


100 


Group B (not using serum) 


70 


30 


100 


Total 


140 


60 


200 



2 _ (75 - 70) 2 (65 - 70) 2 , (25 - 30) 2 , (35 - 30) 2 _ o ^ 
X ~ 70 + 70 + 30 + 30 

To determine the number of degrees of freedom, consider Table 12.13, which is the same as Table 12.12 
except that only the totals are shown. It is clear that we have the freedom of placing only one number in any 
of the four empty cells, since once this is done the numbers in the remaining cells are uniquely determined 
from the indicated totals. Thus there is 1 degree of freedom. 



Table 12.13 





Recover 


Do Not 
Recover 


Total 


Group A 






100 


Group B 






100 


Total 


140 


60 


200 



Another method 

By formula (see Problem 12.18), v={h- \)[k - 1) = (2 - 1)(2 - 1) = 1. Since xls = 3.84 for 1 degree 
of freedom and since \ 2 = 2.38 < 3.84, we conclude that the results are not significant at the 0.05 level. We 
are thus unable to reject H at this level, and we either conclude that the serum is not effective or withhold 
decision, pending further tests. 

Note that \ 2 = 2.38 is the square of the z score, z = 1.54, obtained in Problem 10.20. In general the 
chi-square test involving sample proportions in a 2 x 2 contingene_\ table is equivalent to a test of signifi- 
cance of differences in proportions using the normal approximation. 

Note also that a one-tailed test using \ 2 is equivalent to a two-tailed test using \ since, for example, 
X 2 > X%5 corresponds to \ > X.95 or X < —X.95- Since for 2 x 2 tables \ is the square of the z score, it 
follows that x is the same as z for this case. Thus a rejection of a hypothesis at the 0.05 level using x 2 is 
equivalent to a rejection in a two-tailed test at the 0.10 level using z. 



Chi-Square Test: Recover, Not-recover 

Expected counts are printed below observed counts 
Chi-Square contributions are printed below expected c 

Recover Not-recover Total 

1 75 25 100 

70.00 30.00 

0.357 0.833 



Total 140 60 200 

Chi-Sq=2 .381, DF = 1 , P-Value = . 123 



CHAP. 12] 



THE CHI-SQUARE TEST 



307 



12.15 Work Problem 12.14 by using Yates' correction. 
SOLUTION 

X ^ (corrected) = ^ = + d 65 Z M Z ^ + d 25 Z *L Z ^ + ^ M Z = L93 

Thus the conclusions reached in Problem 12.14 are valid. This could have been realized at once by noting 
that Yates' correction always decreases the value of x 2 - 



12.16 A cellular phone company conducts a survey to determine the ownership of cellular phones in 
different age groups. The results for 1000 households are shown in Table 12.14. Test the 
hypothesis that the proportions owning cellular phones are the same for the different age groups. 



Table 12.14 



Cellular phone 


18-24 


25-54 


55-64 


> 65 


Total 


Yes 


50 


80 


70 


50 


250 


No 


200 


170 


180 


200 


750 


Total 


250 


250 


250 


250 


1000 



SOLUTION 

Under the hypothesis H that the proportions owning cellular phones are the same for the different age 
groups, 250/1000 = 25% is an estimate of the percentage owning a cellular phone in each age group, and 
75% is an estimate of the percent not owning a cellular phone in each age group. The frequencies expected 
under H are shown in Table 12.15. 

The computed value of the chi-square statistic can be found as illustrated in Table 12.16. 

The degrees of freedom for the chi-square distribution is v = (h — \)(k - 1) = (2 - 1)(4 - 1) = 3. Since 
X 2 95 = 7.81, and 14.3 exceeds 7.81, we reject the null hypothesis and conclude that the percentages are not the 
same for the four age groups. 



Table 12.15 



Cellular phone 


18-24 


25-54 


55-64 


> 65 


Total 


Yes 


25% of 250 = 62.5 


25% of 250 = 62.5 


25% of 250 = 62.5 


25% of 250 = 62.5 


250 


No 


75% of 


75% of 


75% of 


75% of 


750 




250 = 187.5 


250 = 187.5 


250 = 187.5 


250 = 187.5 




Total 


250 


250 


250 


250 


1000 



2. 1 

2, 2 
2, 3 
2, 4 
Sum 



62.5 
62.5 
62.5 
62.5 
187.5 
187.5 
187.5 
187.5 
1000 



156.25 
306.25 

56.25 
156.25 
156.25 
306.25 

56.25 
156.25 
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12.17 Use MINITAB to solve Problem 12.16. 



The MINITAB solution to Problem 12.16 is shown below. The observed and the expected counts are 
shown along with the computation of the test statistic. Note that the null hypothesis would be rejected for 
any level of significance exceeding 0.002. 

Data Display 

Row 18-24 25-54 55-64 65 or more 



MTB > chisquare cl-c4 
Chi-SquareTest 

Expected counts are printed below observed counts 

18-24 25-54 55-64 65 or mo 

1 50 80 70 50 
62.50 62.50 62.50 62.50 

2 200 170 180 200 
187.50 187.50 187.50 187.50 

Total 250 250 250 250 

Chi-Sq= 2.500 + 4.900 + 0.900 + 2.500 

0.833 + 1.633 + 0.300 + 0.833 : 
DF = 3, P-Value = 0.002 



12.18 Show that for an h x k contingency table the number of degrees of freedom is (h — 1) x (k - 1), 
where h > 1 and k > 1. 

SOLUTION 

In a table with h rows and k columns, we can leave out a single number in each row and column, since 
such numbers can easily be restored from a knowledge of the totals of each column and row. It follows that 
we have the freedom of placing only (h - \){k - 1) numbers into the table, the others being then automa- 
tically determined uniquely. Thus the number of degrees of freedom is (h - l)(k- 1). Note that this result 
holds if the population parameters needed in obtaining the expected frequencies are known. 



12.19 (a) Prove that for the 2 x 2 contingency table shown in Table 12.17(a), 

2 _ N{a l b 2 -a 2 b l f 



(b) Illustrate the result 



NiN 2 N A N B 

i part (a) with reference to the data of Problem 12.14. 



Table 12.17(a) Results Observed 



Table 12.17(6) Results Expected 





I 


II 


Total 


A 


N X N A /N 


N 2 N A /N 


N A 


B 




N 2 N B /N 


N B 


Total 


N l 


N 2 


N 
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SOLUTION 

(a) As in Problem 12.14, the results expected under a null hypothesis are shown in Table 12.17(6). Then 

2 = (a x -N X N A /N) 2 (a 2 -N 2 N A /N) 2 (b x -N x N B /N) 2 (b 2 -N 2 N B /N) 2 
X N[ N A /N N 2 N A /N N X N B /N N 2 N B /N 

N X N A (a, + b x )(a x + a 2 ) a x b 2 -a 2 b x 
BUI 01 -^T =a> ~ a x +b x +a 2 + b 2 = N 



are also equal t( 



N (ci\b 2 -a 2 b{\ N (a x b 2 - a 2 b{\ 



N\N A \ N J N 2 N A \ N J 
N f a 1 b 2 ~ 1 a 2 b 1 \ 2 _N (a x b 2 -a 2 b x \ 



N X N B \ N J N 2 N B \ 

2 _N(a x b 2 -a 2 b x ) 2 



(b) In Problem 12.14, a x = 75, a 2 = 25, b x = 65, b 2 = 35, N x = 140, N 2 = 60, N A = 100, N B = 100, and 
N = 200; then, as obtained before, 

2 _ 200[(75)(35) - (25)(65)] 2 = 
X (140)(60)(100)(100) 
Using Yates' correction, the result is the same as in Problem 12.15: 

N(\a x b 2 -a 2 b x \-\N) 2 200[| (75) (35) - (25)(65)| - 100] 2 



X (corrected) = 



(140)(60)(100)(100) 



12.20 Nine hundred males and 900 females were asked whether they would prefer more federal pro- 
grams to assist with childcare. Forty percent of the females and 36 percent of the males responded 
yes. Test the null hypothesis of equal percentages versus the alternative hypothesis of unequal 
percentages at a = 0.05. Show that a chi-square test involving two sample proportions is equiva- 
lent to a significance test of differences using the normal approximation of Chapter 10. 

SOLUTION 

Under hypothesis H , 



Hp,-p 2 = and o Pl _ 



^^(^ + i) = ^ - 38)(ara) (4 + 95o)= 



where p is estimated by pooling the proportions in the two samples. That is 
360 + 324 „ „„ 



= 1 -0.38 = 0.62 



The normal approximation test statistic is as follows: 

^ P x - P 2 0.40 - 0.36 
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The MINITAB solution for the chi-square analysis is as follows. 
Chi-SquareTest 

Expected counts are printed below observed counts 

males females Total 

1 324 360 684 
342.00 342.00 

2 576 549 1116 
558.00 558.00 



Chi-Sq = 0.947 + 0.947 + 

0.581 + 0.581 = 3.056 
DF = 1, P-Value = 0.080 

The square of the normal test statistic is (1.7467) 2 = 3.056, the value for the chi-square statistic. The two 
tests are equivalent. The p-values are always the same for the two tests. 



COEFFICIENT OF CONTINGENCY 

12.21 Find the coefficient of contingency for the data in the contingency table of Problem 12.14. 
SOLUTION 

C=J * 2 = J = V0.0U76 = 0.1084 

Y X + N V 2.38 + 200 

12.22 Find the maximum value of C for the 2 x 2 table of Problem 12.14. 
SOLUTION 

The maximum value of C occurs when the two classifications are perfectly dependent or associated. In 
such case all those who take the serum will recover, and all those who do not take the serum will not recover. 
The contingency table then appears as in Table 12.18. 





Recover 


Do Not 
Recover 


Total 


Group A (using serum) 


100 





100 


Group B (not using serum) 





100 


100 


Total 


100 


100 


200 



Since the expected cell frequencies, assuming complete independence, are all equal to 50, 
2 _ (100- 50) 2 (0 - 50) 2 , (0- 50) 2 , (100- 50) 2 _ onA 



Thus the maximum value of C is \/x 2 /{x 2 + N) = ^200/(200 + 200) = 0.7071. 
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In general, for perfect dependence in a contingency table where the number of rows and columns are 
both equal to k, the only nonzero cell frequencies occur in the diagonal from upper left to lower right of the 
contingency table. For such cases, C max = \J(k — l)/k. (See Problems 12.52 and 12.53.) 



CORRELATION OF ATTRIBUTES 

12.23 For Table 12.12 of Problem 12.14, find the correlation coefficient (a) without and (b) with Yates' 
correction. 

SOLUTION 

(a) Since x = 2-38, N = 200, and k = 2, we have 

r=J * 2 =JjL 1091 
Ytf(*-1) V200 

indicating very little correlation between recovery and the use of the serum. 

(b) From Problem 12.15, r (corrected) = ^\ .93/200 = 0.0982. 

12.24 Prove that the correlation coefficient for contingency tables, as defined by equation (12) of this 
chapter, lies between and 1. 

SOLUTION 



By Problem 12.53, the maximum value of \/\ 2 /(\ 2 + <V) is sj(k- \)/k. Thus 

<^ kx 2 <(k-\)(x 2 + N) k X 2 <k X 2 -X 2 + kN~N 
X + N k 

X 2 <(^^ ^T)<' and r =^l<l 
Since % 2 > 0, r > 0. Thus < r < 1, as required. 



ADDITIVE PROPERTY OF Z 2 



12.25 To test a hypothesis H , an experiment is performed three times. The resulting values of \ are 
2.37, 2.86, and 3.54, each of which corresponds to 1 degree of freedom. Show that while H Q 
cannot be rejected at the 0.05 level on the basis of any individual experiment, it can be rejected 
when the three experiments are combined. 

SOLUTION 

The value of \ 2 obtained by combining the results of the three experiments is, according to the additive 
property, % 2 = 2.37 + 2.86 + 3.54 = 8.77 with 1 + 1 + 1 = 3 degrees of freedom. Since x 2 95 for 3 degrees of 
freedom is 7.81, we can reject H at the 0.05 significance level. But since x%5 = 3.84 for 1 degree of freedom, 
we cannot reject H on the basis of any one experiment. 

In combining experiments in which values of \ 2 corresponding to 1 degree of freedom are obtained, 
Yates' correction is omitted since it has a tendency to overcorrect. 
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12.26 In 60 tosses of a coin, 37 heads and 23 tails were observed. Using significance levels of (a) 0.05 and (b) 0.01, 
test the hypothesis that the coin is fair. 

12.27 Work Problem 12.26 by using Yates' correction. 

12.28 Over a long period of time the grades given by a group of instructors in a particular course have averaged 
12% A's,18% B's, 40% C's,18% D's, and 12% F's. A new instructor gives 22 A's, 34 B's, 66 C's, 16 D's, and 
12 F's during two semesters. Determine at the 0.05 significance level whether the new instructor is following 
the grade pattern set by the others. 

12.29 Three coins were tossed a total of 240 times, and each time the number of heads turning up was observed. 
The results are shown in Table 12.19, together with the results expected under the hypothesis that the coins 
are fair. Test this hypothesis at a significance level of 0.05. 



Table 12.19 





Heads 


1 Head 


2 Heads 


3 Heads 


Observed frequency 


24 


108 


95 


23 


Expected frequency 


30 






30 



12.30 The number of books borrowed from a public library during a particular week is given in Table 12.20. Test 
the hypothesis that the number of books borrowed does not depend on the day of the week, using 
significance levels of (a) 0.05 and (b) 0.01. 



Number of books 
borrowed 



12.31 An urn contains 6 red marbles and 3 white ones. Two marbles are selected at random from the urn, their 
colors are noted, and then the marbles are replaced in the urn. This process is performed a total of 120 times, 
and the results obtained are shown in Table 12.21. 

(a) Determine the expected frequencies. 

(b) Determine at a significance level of 0.05 whether the results obtained are consistent with those expected. 



Table 12.21 





Red 


1 Red 


2 Red 




2 White 


1 White 


White 


| Number of drawings 


6 


53 


61 



12.32 Two hundred bolts were selected at random from the production of each of four machines. The numbers of 
defective bolts found were 2, 9, 10, and 3. Determine whether there is a significant difference between the 
machines, using a significance level of 0.05. 



CHAP. 12] 



THE CHI-SQUARE TEST 



313 



GOODNESS OF FIT 

12.33 (a) Use the chi-square test to determine the goodness of fit of the data in Table 7.9 of Problem 7.75. (b) Is the 
fit "too good"? Use the 0.05 significance level. 

12.34 Use the chi-square test to determine the goodness of fit of the data in (a) Table 3.8 of Problem 3.59 and 
(b) Table 3.10 of Problem 3.61. Use a significance level of 0.05, and in each case determine whether the fit is 



12.35 Use the chi-square test to determine the goodness of fit of the data in (a) Table 7.9 of Problem 7.75 and 
(b) Table 7.12 of Problem 7.80. Is your result in part (a) consistent with that of Problem 12.33? 



CONTINGENCY TABLES 

12.36 Table 12.22 shows the result of an experiment to investigate the effect of vaccination of laboratory animals 
against a particular disease. Using the (a) 0.01 and (b) 0.05 significance levels, test the hypothesis that there is 
no difference between the vaccinated and unvaccinated groups (i.e., that vaccination and this disease are 
independent). 

12.37 Work Problem 12.36 using Yates' correction. 

12.38 Table 12.23 shows the numbers of students in each of two classes, A and B, who passed and failed an 
examination given to both groups. Using the (a) 0.05 and (b) 0.01 significance levels, test the hypothesis that 
there is no difference between the two classes. Work the problem with and without Yates' correction. 



Vaccinated 
Not vaccinated 



Got Did Not Get 

Disease Disease 



12.39 Of a group of patients who complained that they did not sleep well, some were given sleeping pills while 
others were given sugar pills (although they all thought they were getting sleeping pills). They were later 
asked whether the pills helped them or not. The results of their responses are shown in Table 12.24. 
Assuming that all patients told the truth, test the hypothesis that there is no difference between sleeping 
pills and sugar pills at a significance level of 0.05. 



Table 12.24 





Slept 
Well 


Did Not 
Sleep Well 


Took sleeping pills 


44 


10 


Took sugar pills 


81 


35 



12.40 On a particular proposal of national importance, Democrats and Republicans cast their votes as shown in 
Table 12.25. At significance levels of (a) 0.01 and (b) 0.05, test the hypothesis that there is no difference 
between the two parties insofar as this proposal is concerned. 
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Table 12.25 





In Favor 


Opposed 


Undecided 


Democrats 


85 


78 


37 


Republicans 


118 


61 


25 



12.41 Table 12.26 shows the relation between the performances of students in mathematics and physics. Test the 
hypothesis that performance in physics is independent of performance in mathematics, using the (a) 0.05 and 
(b) 0.01 significance levels. 

Table 12.26 





Mathematics 


High 
Grades 


Medium 
Grades 


Low 
Grades 


Physics 


High grades 


56 


71 


12 


Medium grades 


47 


163 


38 


Low grades 


14 


42 


85 



12.42 The results of a survey made to determine whether the age of a driver 21 years of age and older has any effect 
on the number of automobile accidents in which he is involved (including all minor accidents) are shown in 
Table 12.27. At significance levels of (a) 0.05 and (b) 0.01, test the hypothesis that the number of accidents is 
independent of the age of the driver. What possible sources of difficulty in sampling techniques, as well as 
other considerations, could affect your conclusions? 



Table 12.27 





Age of Driver 


21-30 


31^40 


41-50 


51-60 


61-70 


Number of 
accidents 





748 


821 


786 


720 


672 


1 


74 


60 


51 


66 


50 




31 




22 


16 


15 


>2 


9 


10 


6 


5 


7 



12.43 (a) Prove that \ = J] (o", 2 /e,) - N for all contingency tables, where N is the total frequency of all cells. 
(b) Using the result in part (a), work Problem 12.41. 

12.44 If Ni and TV, denote, respectively, the sum of the frequencies in the rth row and /th columns of a contingency 
table (the marginal frequencies), show that the expected frequency for the cell belonging to the ith row and jth 
column is NiNj/N, where N is the total frequency of all cells. 

12.45 Prove formula (9) of this chapter. (Hint: Use Problems 12.43 and 12.44.) 



12.46 Extend the result of formula (9) of this chapter to 2 x k contingency tables, where k > 3. 
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12.47 Prove formula (8) of this chapter. 

12.48 By analogy with the ideas developed for h x k contingency tables, discuss // x k x / contingency tables, 
pointing out their possible applications. 

COEFFICIENT OF CONTINGENCY 

12.49 Table 12.28 shows the relationship between hair and eye color of a sample of 200 students. 

(a) Find the coefficient of contingency without and with Yates' correction. 

(b) Compare the result of part (a) with the maximum coefficient of contingency. 



Table 12.28 





Hair Color 


Blonde 


Not blonde 


Eye color 


Blue 


49 


25 


Not blue 


30 


96 



12.50 Find the coefficient of contingency for the data of (a) Problem 12.36 and (b) Problem 12.38 without and with 
Yates' correction. 

12.51 Find the coefficient of contingency for the data of Problem 12.41. 

12.52 Prove that the maximum coefficient of contingency for a 3 x 3 table is = 0.8165 approximately. 

12.53 Prove that the maximum coefficient of contingency for a k x k table is \/(k — \)/k. 

CORRELATION OF ATTRIBUTES 

12.54 Find the correlation coefficient for the data in Table 12.28. 

12.55 Find the correlation coefficient for the data in (a) Table 12.22 and (b) Table 12.23 without and with Yates' 
correction. 

12.56 Find the correlation coefficient between the mathematics and physics grades in Table 12.26. 

12.57 If C is the coefficient of contingency for a k x k table and r is the corresponding coefficient of correlation, 
prove that r = C/V(l - C 2 )(k - 1). 

ADDITIVE PROPERTY OF Z 2 

12.58 To test a hypothesis H , an experiment is performed five times. The resulting values of % > e ach correspond- 
ing to 4 degrees of freedom, are 8.3, 9.1, 8.9, 7.8, and 8.6, respectively. Show that while H cannot be rejected 
at the 0.05 level on the basis of each experiment separately, it can be rejected at the 0.005 level on the basis of 
the combined experiments. 



Curve Fitting and the 
I Method of Least 
I Squares 



RELATIONSHIP BETWEEN VARIABLES 

Very often in practice a relationship is found to exist between two (or more) variables. For 
example, weights of adult males depend to some degree on their heights, the circumferences of circles 
depend on their radii, and the pressure of a given mass of gas depends on its temperature and 
volume. 

It is frequently desirable to express this relationship in mathematical form by determining an 
equation that connects the variables. 



CURVE FITTING 

To determine an equation that connects variables, a first step is to collect data that show corre- 
sponding values of the variables under consideration. For example, suppose that X and Y denote, 
respectively, the height and weight of adult males; then a sample of N individuals would reveal the 
heights Xi, X 2 , . . . , X N and the corresponding weights Y ls Y 2 , Y N . 

A next step is to plot the points {X X ,Y X ), (X 2 ,Y 2 ), . . . , (X N ,Y N ) on a rectangular coordinate system. 
The resulting set of points is sometimes called a scatter diagram. 

From the scatter diagram it is often possible to visualize a smooth curve that approximates the data. 
Such a curve is called an approximating curve. In Fig. 13-1, for example, the data appear to be approxi- 
mated well by a straight line, and so we say that a linear relationship exists between the variables. In 
Fig. 13-2, however, although a relationship exists between the variables, it is not a linear relationship, 
and so we call it a nonlinear relationship. 

The general problem of finding equations of approximating curves that fit given sets of data is called 
curve fitting. 
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Scatterplot of weight vs height 
210- • 
200 - 




150 
140 
130 



64 65 66 67 68 69 70 71 72 73 
Height 

Fig. 13-1 Straight lines sometimes describe the relationship between two variables. 



Scatterplot of number present vs time 




2 4 6 8 10 

Fig. 13-2 Nonlinear relationships sometimes describe the relationship between two variables. 

EQUATIONS OF APPROXIMATING CURVES 

Several common types of approximating curves and their equations are listed below for reference 
purposes. All letters other than Zand 7 represent constants. The variables X and Y are often referred to 
as independent and dependent variables, respectively, although these roles can be interchanged. 



Straight line 


Y 


— a + a\X 


U) 


Parabola, or quadratic curve 


Y 


— a + a x X + a 2 X 2 


(2) 


Cubic curve 


Y 


= a + aiX+ a 2 X 2 + a 3 X 3 


(3) 


Quartic curve 


Y 


= a + a x X + a 2 X 2 + a 3 X 3 + a 4 X 4 


{4) 


nth-Degree curve 


Y 


= a + a x X + a 2 X 2 H h a„X" 


{5) 
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The right sides of the above equations are called polynomials of the first, second, third, fourth, and «th 
degrees, respectively. The functions defined by the first four equations are sometimes called linear, 



quadratic, cubic, and quartic functions, respectively. 

The following are some of the many other equations frequently used in practice: 

Hyperbola Y = -*— - or ^ = a o + a i x ( 6 ) 

a () + a\\ i 

Exponential curve Y = ab x or log Y = log a + (\ogb)X = a Q + a x X (7) 

Geometric curve Y = aX b or log Y — log a + b (log X) (8) 

Modified exponential curve Y = ab x + g (9) 

Modified geometric curve Y = aX + g (10) 

Gompertz curve Y = pq h or log Y = \ogp + b x (logq) = ab x + g (11) 

Modified Gompertz curve Y = pq h + h (12) 

Logistic curve Y = ^ ^ or y = ab x + g (13) 

Y=a Q +a l (\ogX) + a 2 (logX) 2 (14) 



To decide which curve should be used, it is helpful to obtain scatter diagrams of transformed 
variables. For example, if a scatter diagram of log Y versus X shows a linear relationship, the equation 
has the form (7), while if log Y versus log X shows a linear relationship, the equation has the form (8). 
Special graph paper is often used in order to make it easy to decide which curve to use. Graph paper 
having one scale calibrated logarithmically is called semilogarithmic (or semilog) graph paper, and that 
having both scales calibrated logarithmically is called log-log graph paper. 



FREEHAND METHOD OF CURVE FITTING 

Individual judgment can often be used to draw an approximating curve to fit a set of data. This is 
called & freehand me I hod of curve fitting. If the type of equation of this curve is known, it is possible to 
obtain the constants in the equation by choosing as many points on the curve as there are constants in 
the equation. For example, if the curve is a straight line, two points are necessary; if it is a parabola, three 
points are necessary. The method has the disadvantage that different observers will obtain different 
curves and equations. 



THE STRAIGHT LINE 

The simplest type of approximating curve is a straight line, whose equation can be written 

Y = a + a x X (15) 

Given any two points (X U Y { ) and (X 2 ,Y 2 ) on the line, the constants a and a x can be determined. The 
resulting equation of the line can be written 



(16) 
Y 2~Yi 

where m = — 

A 2 - A] 

is called the slope of the line and represents the change in Y divided by the corresponding change in X. 
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When the equation is written in the form (75), the constant a x is the slope m. The constant a , which 
is the value of Y when X = 0, is called the Y intercept. 



THE METHOD OF LEAST SQUARES 

To avoid individual judgment in constructing lines, parabolas, or other approximating curves to fit 
sets of data, it is necessary to agree on a definition of a "best-fitting line," "best-fitting parabola," etc. 

By way of forming a definition, consider Fig. 13-3, in which the data points are given by (X 1 ,Yi), 
(X 2 ,Y 2 ), . . . , (X N ,Y N ). For a given value of X, say X x , there will be a difference between the value Y x and 
the corresponding value as determined from the curve C. As shown in the figure, we denote this 
difference by D x , which is sometimes referred to as a deviation, error, or residual and may be positive, 
negative, or zero. Similarly, corresponding to the values X 2 ,...,X N we obtain the deviations 
D 2 ,...,D N . 

A measure of the "goodness of fit" of the curve C to the given data is provided by the quantity 

D\ + Z>2 + 1- D 2 N . If this is small, the fit is good; if it is large, the fit is bad. We therefore make the 

following 

Definition: Of all curves approximating a given set of data points, the curve having the 
property that D] + D 2 2 -\ h D 2 N is a minimum is called a best-fitting curve. 

A curve having this property is said to fit the data in the least-squares sense and is called a least-squares 
curve. Thus a line having this property is called a least-squares line, a parabola with this property is called 
a least-squares parabola, etc. 

It is customary to employ the above definition when X is the independent variable and Y is the 
dependent variable. If X is the dependent variable, the definition is modified by considering horizontal 
instead of vertical deviations, which amounts to an interchange of the X and Y axes. These two defini- 
tions generally lead to different least-squares curves. Unless otherwise specified, we shall consider Y the 
dependent variable and X the independent variable. 

It is possible to define another least-squares curve by considering perpendicular distances from each 
of the data points to the curve instead of either vertical or horizontal distances. However, this is not used 
very often. 




THE LEAST-SQUARES LINE 

The least-squares line approximating the set of points {X x ,Y X ), (X 2 , Y 2 ), 

Y=a + ai X 



, (X N ,Y N ) has the equation 
(77) 
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where the constants a a and a x are determined by solving simultaneously the equations 
£ Y = a N + a x £ X 

£jrr = fl £* + ai E^ 2 (18) 

which are called the normal equations for the least-squares line (77). The constants a and a\ of equations 
(18) can, if desired, be found from the formulas 

a = (E y)(Ex 2 )-(E jO(E *r) = jVE£I^(E £XE I) (19) 
NJ2^ 2 -(J2^) 2 ' 7VE^ 2 -(E^) 2 

The normal equations (18) are easily remembered by observing that the first equation can be 
obtained formally by summing on both sides of (17) [i.e., £ Y = E ( a o + a \ x ) = a N + a { £ X], 
while the second equation is obtained formally by first multiplying both sides of (17) by X and then 
summing [i.e., £ XY — X(a + a Y X) — a Q X + a x £ X 2 }. Note that this is not a derivation of the 
normal equations, but simply a means for remembering them. Note also that in equations (18) and (19) 
we have used the short notation E ^> E XY, etc -' 01 pl ace of E/=i Xj, E/=i XjYj, etc. 

The labor involved in finding a least-squares line can sometimes be shortened by transforming the 
data so that x = X — X and y = Y — Y . The equation of the least-squares line can then be written 
(see Problem 13.15) 

,= (§?>_ o,. ,= (|f). 
In particular, if X is such that J2 X — (i- e -, X = 0), this becomes 

Y=Y+(^0^X (21) 

Equation (20) implies that y = when x = 0; thus the least-squares line passes through the point 
(X,Y), called the centroid, or center of gravity, of the data. 

If the variable X is taken to be the dependent instead of the independent variable, we write equation 
(17) as X — b Q + b y Y. Then the above results hold if Zand Fare interchanged and a and a x are replaced 
by b and b { , respectively. The resulting least-squares line, however, is generally not the same as that 
obtained above [see Problems 13.11 and 1 3.1 5(d)] . 

NONLINEAR RELATIONSHIPS 

Nonlinear relationships can sometimes be reduced to linear relationships by an appropriate trans- 
formation of the variables (see Problem 13.21). 

THE LEAST-SQUARES PARABOLA 

The least-squares parabola approximating the set of points (X U Y { ), (X 2 ,Y 2 ), (X N ,Y N ) has the 
equation 

Y = a Q + a x X + a 2 X 2 (22) 
where the constants a , a\, and a 2 are determined by solving simultaneously the equations 
E Y ="oN +a { J2X +a 2 J2X 2 

E XY =«oE^ +0i E^ + ^E^ 3 

£A- 2 r = fl oE^ 2 + «iE^ 3 + «2E^ 4 (23) 

called the normal equal ions for the least-squares parabola (22). 
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Equations (23) are easily remembered by observing that they can be obtained formally by multi- 
plying equation (22) by 1, X, and X 2 , respectively, and summing on both sides of the resulting equations. 
This technique can be extended to obtain normal equations for least-squares cubic curves, least-squares 
quartic curves, and in general any of the least-squares curves corresponding to equation (5). 

As in the case of the least-squares line, simplifications of equations (23) occur if X is chosen so that 
£ X = 0. Simplification also occurs by choosing the new variables x = X - X and y=Y-Y. 



REGRESSION 

Often, on the basis of sample data, we wish to estimate the value of a variable Y corresponding to a 
given value of a variable X. This can be accomplished by estimating the value of Y from a least-squares 
curve that fits the sample data. The resulting curve is called a regression curve of Y on X, since Y is 
estimated from X. 

If we wanted to estimate the value of X from a given value of Y, we would use a regression curve of X 
on Y, which amounts to interchanging the variables in the scatter diagram so that X is the dependent 
variable and Y is the independent variable. This is equivalent to replacing the vertical deviations in the 
definition of the least-squares curve on page 284 with horizontal deviations. 

In general, the regression line or curve of Y on X is not the same as the regression line or curve 
of X on Y. 



APPLICATIONS TO TIME SERIES 

If the independent variable Xis time, the data show the values of Y at various times. Data arranged 
according to time are called time series. The regression line or curve of Y on X in this case is often called a 
trend line or trend curve and is often used for purposes of estimation, prediction, or forecasting. 



PROBLEMS INVOLVING MORE THAN TWO VARIABLES 

Problems involving more than two variables can be treated in a manner analogous to that for two 
variables. For example, there may be a relationship between the three variables X, Y, and Z that can be 
described by the equation 



which is called a linear equation in the variables X, Y, and Z. 

In a three-dimensional rectangular coordinate system this equation represents a plane, and the actual 
sample points (X l ,Y l ,Z l ), (X 2 , Y 2 , Z 2 ), . . . , (X N , Y N ,Z N ) may "scatter" not too far from this plane, 
which we call an approximating plane. 

By extension of the method of least squares, we can speak of a least-squares plane approximating the 
data. If we are estimating Z from given values of X and Y, this would be called a regression plane ofZ on 
X and Y. The normal equations corresponding to the least-squares plane (24) are given by 



Z = a Q + a l X + a 2 Y 



(24) 



£Z =a () N +a, J2X +a 2 J2Y 
J2XZ^a J2X + ai £ X 2 +a 2 J2XY 
E YZ=a J2 Y+a, E^ + «2E Y 2 



(25) 



and can be remembered as being obtained from equation (24) by multiplying by 1 , X, and Y successively 
and then summing. 
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More complicated equations than (24) can also be considered. These represent regression surfaces. If 
the number of variables exceeds three, geometric intuition is lost since we then require four-, five-, . . . 
^-dimensional spaces. 

Problems involving the estimation of a variable from two or more variables are called problems of 
multiple regression and will be considered in more detail in Chapter 15. 



Solved Problems 



STRAIGHT LINES 



Thirty high school students were surveyed in a study involving time spent on the Internet and 
their grade point average (GPA). The results are shown in Table 13.1. X is the amount of time 
spent on the Internet weekly and Y is the GPA of the student. 



Hours GPA Hours GPA Hours GPA 



2.36 
2.60 
2.42 



2.35 
3.14 
3.05 
2.06 
2.00 
2.78 
2.90 



3.14 
2.96 
2.30 
2.66 
2.36 
2.24 
3.08 
2.84 
2.45 



Use MINITAB to do the following: 

(a) Make a scatter plot of the data. 

(b) Fit a straight line to the data and give the values of a and a x . 
SOLUTION 

(a) The data are entered into columns CI and C2 of the MINITAB worksheet. CI is named Internet- 
hours and C2 is named GPA. The pull-down menu Stat — » Regression — » Regression is given and the 
results shown in Fig. 13-4 are formed. 

(b) The value for a a is 3.49 and the value for a x is -0.0594. 



13.2 Solve Problem 13.1 using EXCEL. 
SOLUTION 

The data is entered into columns A and B of the EXCEL worksheet. The pull-down menu 
Tools — » Data Analysis — » Regression produces the dialog box in Fig. 13-5 which is filled in as shown. 
The part of the output which is currently of interest is 

Intercept 3.488753 

Internet-hours -0.05935 
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3.50 
3.25 
3.00 
< 2.75 
2.50 
2.25 
2.00 



5 10 15 20 25 

Internet-hours 

Fig. 13-4 The sum of the squares of the distances of the points from the best-fitting line is a minimum for the line 
GPA = 3.49 - 0.0594 Internet-hours. 
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Output options 










O Output Range; 






New Worksheet Ply: 


1 




O Mew Workbook 






Residuals 






1 I Residuals 


□ Residual Plots 




1 I Standardized Residuals 


□ Line Fit Plots 




Normal Probability 

j i Normal Probability Plots 















Fig. 13-5 EXCEL dialog box for Problem 13.2. 



The constant a is called the intercept and a x is the slope. The same values as given by MINITAB are 
obtained. 

13.3 (a) Show that the equation of a straight line that passes through the points (X l , Y { ) and (X 2 , Y 2 ) 
is given by 



Scatterplot of GPA vs Internet-hours 
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(b) Find the equation of a straight line that passes through the points (2, -3) and (4, 5). 
SOLUTION 

(a) The equation of a straight line is 

Y = a + a x X (29) 
Y x =a + a x X x (30) 
Y 2 = a + a x X 2 (31) 
Y- Y x =a l (X-X l ) (32) 



Since (X X ,Y X ) lies on the line, 
Since (X 2 ,Y 2 ) lies on the line, 
Subtracting equation (30) from (29), 
Subtracting equation (30) from (31), 



Y 2 -Y l =a l (X 2 -X l ) or a x = ^— 
Substituting this value of a x into equation (32), we obtain 



as required. The quantity 

Y 2 -Y x 
X 2 -X x 

often abbreviated m, represents the change in Y divided by the corresponding change in X and is the 
slope of the line. The required equation can be written Y - Y x = m(X - X x ). 
(b) First method [using the result of part (a)] 

Corresponding to the first point (2,-3), we have X x = 2 and Y x = -3; corresponding to the 
second point (4, 5), we have X 2 = 4 and Y 2 = 5. Thus the slope is 

F 2 -F 1 _ 5-(-3) _8_ 
X 2 -X x 4-2 2 

and the required equation is 

Y — Y x = m(X — X x ) or F — (—3) = 4(X — 2) 
which can be written F + 3 = 4(X — 2), or F = 4X — 11. 
Second method 

The equation of a straight line is Y — a + a x X. Since the point (2,-3) is on the line 
-3 = a + 2a x , and since the point (4,5) is on the line, 5 = a + 4a x ; solving these two equations 
simultaneously, we obtain a x = 4 and a = — 1 1 . Thus the required equation is 



13.4 Wheat is grown on 9 equal-sized plots. The amount of fertilizer put on each plot is given in 
Table 13.2 along with the yield of wheat. 

Use MINITAB to fit the parabolic curve Y=a + a x X+a 2 X 2 to the data. 
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Table 13.2 



Amount of Wheat (y) 


Fertilizer (x) 


2.4 


1.2 


3.4 


2.3 


4.4 


3.3 


5.1 


4.1 


5.5 


4.8 


5.2 


5.0 


4.9 


5.5 


4.4 


6.1 


3.9 


6.9 



SOLUTION 

The wheat yield is entered into CI and the fertilizer is entered into C2. The pull-down 
Stat -> Regression -> Fitted Line Plot gives the dialog box in Fig. 13-6. 

Response |Y): |@ 
Predictor (X): U 

Type of Regression Model 
r Linear P Quadratic C £ubic 



Graphs.. . | Options... | Storage... | 

Help OK Cancel | 



Fig. 13-6 MINITAB dialog box for Problem 13.4. 

This dialog box gives the output in Fig. 13-7. 

Fitted Line Plot 
Y= -0.2009 + 2.266 X 
-0.2421 X**2 

5.5 
5.0 
4.5 
4.0 
3.5 
3.0 
2.5 
2.0 

1 2 3 4 5 6 7 

X 

Fig. 13-7 Fitting the least-squares parabolic curve to a set of data using MINITAB. 



CI Y 
C2 X 
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13.5 Find (a) the slope, (b) the equation, (c) the Y intercept, and (d) the X intercept of the line that 
passes through the points (1,5) and (4,-1). 



SOLUTION 

(a) (JT, = 1, Y l = 5) and (X 2 = 4, Y 2 -- 



■1). Thus 
, Y 2 -Yi 

m=slope= x 2 ^x l = 

The negative sign of the slope indicates that as X 



(b) The equation of the line is 



Y-Yi =m(X-X l ) 
Y-5 = -IX + 2 



Y decreases, as shown in Fig. 13-8 



Y-5= -2(X- 1) 
Y =1 -2X 



This can also be obtained by the second method of Problem 13. 3(b). 
(c) The Y intercept, which is the value of Y when X = 0, is given by Y = 7 - 2(0) = 7. This can also be 
seen directly from Fig. 13-8. 




ig. 13-8 Straight line showing X intercept and Y intercept. 



(d) The X intercept is the value of X when Y = 0. Substituting Y = in the equation Y = 1 - 2X, we have 
= 7 - 2X, or 2X = 7 and X = 3.5. This can also be seen directly from Fig. 13-8. 



13.6 Find the equation of a line passing through the point (4, 2) that is parallel to the line 2X + 37 = 6. 
SOLUTION 

If two lines are parallel, their slopes are equal. From 2X + 3 Y = 6 we have 3 Y = 6 - 2X, or 
Y = 2 — ^X, so that the slope of the line is m = — §. Thus the equation of the required line is 

Y — Y[ = m(X — X\) or F-2 = -f(X-4) 

which can also be written 2X + 3 Y = 14. 
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Another method 

Any line parallel to 2X + 3 Y = 6 has the equation 2X + 3 Y = c . To find c, let X = 4 and Y = 2. Then 
2(4) + 3(2) = c, or c = 14, and the required equation is 2X + 3Y = 14. 



13.7 Find the equation of a line whose slope is -4 and whose Y intercept is 16. 
SOLUTION 

In the equation Y = a + a l X, a = 16 is the Y intercept and a x = —4 is the slope. Thus the required 
equation is Y = 16 - 4X. 



13.8 (a) Construct a straight line that approximates the data of Table 13.3. 
(b) Find an equation for this line. 

Table 13.3 

X 1 3 4 6 8 9 11 14 
yi24457 8 9 

SOLUTION 

(a) Plot the points (1, 1), (3,2), (4,4), (6,4), (8,5), (9,7), (11,8), and (14,9) on a rectangular coordinate 
system, as shown in Fig. 13-9. A straight line approximating the data is draw n Irvclunul in the figure. 
For a method eliminating the need for individual judgment, see Problem 13.11, which uses the method 
of least squares. 




2 4 6 8 10 12 14 

X 

Fig. 13-9 Freehand method of curve fitting. 



(b) To obtain the equation of the line constructed in part (a), choose any two points on the line, such as P 
and Q; the coordinates of points P and Q, as read from the graph, are approximately (0, 1) and (12, 7.5). 
The equation of the line is Y = a Q + a x X . Thus for point (0, 1) we have 1 = a + fli(0), and for point 
(12, 7.5) we have 7.5 = a + \2a x ; since the first of these equations gives us a = 1, the second gives us 
a, = 6.5/12 = 0.542. Thus the required equation is Y = 1 + 0.542Z. 

Another method 



(X- 0) = 0.542Z 
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13.9 (a) Compare the values of Y obtained from the approximating line with those given in 
Table 13.2. 
(b) Estimate the value of Y when X = 10. 



SOLUTION 

(a) For X = 1, Y = 1 + 0.542(1) = 1.542, or 1.5. For X = 3, Y = 1 + 0.542(3) = 2.626 or 2.6. The values 
of Y corresponding to other values of X can be obtained similarly. The values of Y estimated from the 
equation Y = 1 + 0.542X are denoted by Y cst . These estimated values, together with the actual data 
from Table 13.3, are shown in Table 13.4. 

(b) The estimated value of Y when X = 10 is Y = 1 + 0.542(10) = 6.42, or 6.4. 



13.10 Table 13.5 shows the heights to the nearest inch (in) and the weights to the nearest pound (lb) of a 
sample of 12 male students drawn at random from the first-year students at State College. 



Table 13.5 



Height X (in) 


70 63 72 60 66 70 74 65 62 67 65 68 


Weight Y (lb) 


155 150 180 135 156 168 178 160 132 145 139 152 



(a) Obtain a scatter diagram of the data. 

(b) Construct a line that approximates the data. 

(c) Find the equation of the line constructed in part (b). 

(d) Estimate the weight of a student whose height is known to be 63 in. 

(e) Estimate the height of a student whose weight is known to be 1681b. 



{,-,') The scatter diagram, shown in Fig. 13-10, is obtained by plotting the points (70, 155), (63, 150), 
(68, 152). 

(b) A straight line that approximates the data is shown dashed in Fig. 13-10. This is but one of the many 
possible lines that could have been constructed. 

(c) Choose any two points on the line constructed in part (b), such as P and Q, for example. The coordi- 
nates of these points as read from the graph are approximately (60, 130) and (72, 170). Thus 

Y-Y^^iX-X^ r-130 = !^(*-60) Y^P-TO 

(d) If X = 63, then Y = f{63) - 70 = 140 1b. 

(e) If Y = 168, then 168 = fX- 70, fx = 238, and X = 71.4, or 71 in. 
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60 62 64 66 68 70 72 74 
Height 

Fig. 13-10 Freehand method of curve fitting. 



THE LEAST-SQUARES LINE 

13.11 Fit a least-squares line to the data of Problem 13.8 by using (a) Xas the independent variable and 
(b) X as the dependent variable. 

SOLUTION 

(a) The equation of the line is Y = a + a x X. The normal equations are 

£ Y =a N ifl.El 

J2XY = a J2X + a t £ X 2 

The work involved in computing the sums can be arranged as in Table 13.6. Although the 
right-hand column is not needed for this part of the problem, it has been added to the table for use in 
part (b). 

Since there are eight pairs of values of X and Y, N = 8 and the normal equations become 
8a + 56a, = 40 
56a + 524a, = 364 

Solving simultaneously, a Q = f^, or 0.545; a, = ^, or 0.636; and the required least-squares line is 
Y = £ + IX, or Y = 0.545 + 0.636Z. 



Table 13.6 



X 


Y 


X 2 


XY 


Y 2 


1 

3 


1 

2 


1 

9 


1 

6 


1 

4 


4 


4 


16 


16 


16 


6 


4 


36 


24 


16 


8 


5 


64 


40 


25 


9 


7 


81 


63 


49 


11 


8 


121 


88 


64 


14 


9 


196 


126 


81 


£X = 56 


£ F = 40 


Y,X 2 = 524 


J2 XY = 364 


£ Y 2 = 256 
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Another method 

a ^ (E y)(S X 2 ) - (E *)(S XY) = (40)(524) - (56)(364) = 6 m q ^ 

A f E^ 2 -(E^) 2 (8)(524) - (56) 2 H 

a = £ - (£ 7) = (8) (364) - (56) (40) = 7 ^ ^ 

1 ^E^ 2 -(E^) 2 (8)(524) - (56) 2 11 

Thus Y = a + aiX, or Y = 0.545 + 0.636X, as before. 
(b) If X is considered the dependent variable, and Y the independent variable, the equation of the least- 
squares line is X = b + b\ Y and the normal equations are 

£JT =b N +bt E Y 

E^=*oE Y + bi E Y 2 
Then from Table 13.6 the normal equations become 

8*o + 406] = 56 
40/) + 2566, = 364 

from which b = - or -0.50, and b x = | or 1.50. These values can also be obtained from 
b ^ (E Y 2 ) - (E y)(E ^) _ (56)(256) - (40)(364) = Q g() 

° ^E^-IE if (8)(256) - (40) 2 

b _ NJ:xY-(J: X)(E y) _ (8)(364) - (56)(40) ^ t g() 

' ^E^ 2 -(En 2 (8)(256)-(40) 2 

Thus the required equation of the least-squares line is X = b + b x Y, or X = -0.50 + 1.507. 

Note that by solving this equation for Y we obtain Y = \ + \X, or Y = 0.333 + 0.667X, which is 
not the same as the line obtained in part (a). 

13.12 For the height/weight data in Problem 13.10 use the statistical package SAS to plot the observed 
data points and the least-squares line on the same graph. 



In Fig. 13-11, the observed data values are shown as open circles, and the least-squares line is shown a 
a dashed line. 



Fig. 13-11 SAS plot of the data points from Table 13.5 and the least-sqm 
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13.13 (a) Show that the two least-squares lines obtained in Problem 13.11 intersect at point (X,Y). 

(b) Estimate the value of Y when X — 12. 

(c) Estimate the value of X when 7 = 3. 

SOLUTION 

Thus point (X, Y), called the centroid, is (7, 5). 

Point (7, 5) lies on line Y = 0.545 + 0.636X; or, more exactly, Y = ± + ^X, since 5 = ^ + ^ (7). Point 
(7, 5) lies on line X=-\ + \Y, since 7 = - £ + § (5). 

Another method 

The equations of the two lines are Y = j\ + j\X and X = —j + ^Y. Solving simultaneously, 
we find that X = 1 and Y = 5. Thus the lines intersect at point (7,5). 
Putting X = 12 into the regression line of Y (Problem 13.11), Y = 0.545 + 0.636(12) = 8.2. 
Putting 7 = 3 into the regression line of X (Problem 13.1 1), X = -0.50 + 1.50(3) = 4.0. 



13.14 Prove that a least-squares line always passes through the point (X, Y). 
SOLUTION 

Case 1 (X is the independent variable) 

The equation of the least-squares line is 

Y = a + a] X (34) 
A normal equation for the least-squares line is 

£ Y = a N + ai £X (35) 
Dividing both sides of equation (35) by TV gives 

Y = a + a] X (36) 
Subtracting equation (36) from equation (34), the least-squares line can be written 

Y-Y = ai (X-X) (37) 

which shows that the line passes through the point (X, Y). 

Case 2 (Y is the independent variable) 

Proceeding as in Case 1, but interchanging X and Y and replacing the constants a and a x with b and 
b u respectively, we find that the least-squares line can be written 

X-X=b l (Y-Y) (38) 

which indicates that the line passes through the point (X, Y). 

Note that lines (37) and (38) are not coincident, but intersect in (X, Y). 



13.15 (a) Considering X to be the independent variable, show that the equation of the least-squares 
line can be written 



where x = X - X and y = Y - Y. 
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(b) If X = 0, show that the least-squares line in part (a) can be written 

(c) Write the equation of the least-squares line corresponding to that in part (a) if Y is the 
independent variable. 

(d) Verify that the lines in parts (a) and (c) are not necessarily the same. 

SOLUTION 

(a) Equation (37) can be written y = a x x, where x = X - X and v — Y - Y. Also, from the simultaneous 
solution of the normal equations (18) we have 

a _ NJ:XY-(J: X)(£ Y) ^ NJ2(x + X)(y+Y)-lj:(x + X}lj:(y+Y)} 
NJ2X 2 -(J2X) 2 NJ2(x + X) 2 ~{J:(x + X)} 2 

_ NJ2(xy + xY + Xy + XY)~(j:x + NX)(j:y + NY) 

N £ (x 2 + 2xX + X 2 )-(J2x + NX) 2 
_ NJ2xy + NYJ2x + NXJ2y + N ^ f - x + NX){J2 y + NY) 
JV £ x 2 + 2NX J2 x + N 2 X 2 - (E * + NX) 2 
But £ x = £ (X - X) = and £ y = £ ( Y - Y) = 0; hence the above simplifies to 

_NJ2 xy + N 2 XY - N 2 XY _ J2 xy 
Ul ~ NJ2x 2 + N 2 X 2 -N 2 X 2 ~ £l? 

This can also be written 

J£xy J£x(Y-Y) J£xY-Y-£x Y: *Y 

1 E x 2 E x 2 " Ex 2 E x 2 

Thus the least-squares line is y = a x x; that is, 

>-<S& - -(%?> 

(A) If X = 0, x = X - X = X. Then from 

-(If) 

Another method 

The normal equations of the least-squares line Y = a + a{X are 

£F = a N + a,Y.X and J2 XY = a J2 X + a x J2 X 2 
If z = (J2 X)/N = 0, then ]T X = and the normal equations become 



= E_^F 

E^ 2 

(if)- 





J2 Y = a N and 


from which 


«o = ^=F 




TV 


Thus the required eqi 


lation of the least-squares 




F = a + a,X or 
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(c) By interchanging X and Y or x and y, we can show as in part (a) thai 



(J) From part (a), the least-squares line is 



From part (c). the least-squares line is 



(39) 



(If) 



Since in general ^ J^- 

E -* 2 E 

the least-squares lines (39) and (40) are different in general. Note, however, that they intersect at x = 
and y = [i.e., at the point (X,Y)l 

13.16 If X' = X + A and Y' = Y + B, where A and B are any constants, prove that 

= v E - (E jO(E y) = jv E ^ - (E *')(E = > 

AT EZ 2_ (EX) 2 V^Z' 2 -(E^') 2 



x' = X' - X 1 = (X + A) - (X + A) = X - X = x 
y' = Y' - ?' = (Y + B) - (Y + B) = Y - Y = y 



Then 



E xy = xj_ 

E E x' 2 



and the result follows from Problem 13.15. A similar result holds for b\. 

This result is useful, since it enables us to simplify calculations in obtaining the regression line b\ 
subtracting suitable constants from the variables Zand Y (see the second method of Problem 13.17). 

Note: The result does not hold if X' = c x X + A and Y' = c 2 Y + B unless c x = c 2 . 



13.17 Fit a least-squares line to the data of Problem 13.10 by using (a) X as the independent variable 
and (b) X as the dependent variable. 

SOLUTION 

First method 

(a) From Problem 13.15(a) the required line is 




where x = X — X and y= Y - Y. The work involved in computing the sums can be arranged as in 
Table 13.7. From the first two columns we find X = 802/12 = 66.8 and Y = 1850/12 = 154.2. The last 
column has been added for use in part (b). 
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Height X 


Weight F 


x = X - X 


y=Y-Y 


xy 


x 2 


y 2 


70 


155 


3.2 


0.8 


2.56 


10.24 


0.64 


63 


150 


-3.8 


-4.2 


15.96 


14.44 


17.64 


72 


180 


5.2 


25.8 


134.16 


27.04 


665.64 


60 


135 


-6.8 


-19.2 


130.56 


46.24 


368.64 


66 


156 


-0.8 


1.8 


-1.44 


0.64 


3.24 


70 


168 


3.2 


13.8 


44.16 


10.24 


190.44 


74 


178 


7.2 


23.8 


171.36 


51.84 


566.44 


65 


160 


-1.8 


5.8 


-10.44 


3.24 


33.64 


62 


132 


-4.8 


-22.2 


106.56 


23.04 


492.84 


67 


145 


0.2 


-9.2 


-1.84 


0.04 


84.64 


65 


139 


-1.8 


-15.2 


27.36 


3.24 


231.04 


68 


152 


1.2 


-2.2 


-2.64 


1.44 


4.84 


E X = 802 
X = 66.8 


E Y= 1850 
Y = 154.2 






E xy= 616.32 


E x 2 = 191.68 


E y 2 = 2659.68 



The required least-squares line is 

, = ^V = — - = 3 22* 

or Y- 154.2 = 3.22(^-66.8), which can be written F = ~i.22X — 60.9. This equation is called the 
regression line of Y on X and is used for estimating Y from given values of X. 
(b) If X is the dependent variable, the required line is 

/£M ^ 

VEW 2659.68 ' ' 

which can be written X - 66.8 = 0.232(F - 154.2), or X = 31.0 + 0.232F. This equation is called the 
regression line of X on Y and is used for estimating X from given values of F. 
Note that the method of Problem 13.11 can also be used if desired. 

Second method 

Using the result of Problem 13.16, we may subtract suitable constants from X and F. We choose to 
subtract 65 from Xand 150 from F. Then the results can be arranged as in Table 13.7. 

a ^ N E X'Y 1 - (E X')(£ Y') ^ (12)(708) - (22)(50) _ g ^ 
01 NY.X 12 - (E X 1 ) 2 (12)(232)-(22) 2 

b = NJ2X'Y'-(Z F')(E X') ^ (12)(708) - (50)(22) _ Q ^ 
' N E Y' 2 - (E n 2 (12)(2868) - (50) 2 

Since Z = 65 + 22/12 = 66.8 and F = 150 + 50/12 = 154.2, the regression equations are F - 154.2 = 
3.22(X-66.8) and X - 66.8 = 0.232(F - 154.2); that is F = 3.22X — 60.9 and X = 0.232F + 31.0, in 
agreement with the first method. 



13.18 Work Problem 13.17 using MINITAB. Plot the regression line of weight on height and the 
regression line of height on weight on the same set of axes. Show that the point (X, Y) satisfies 
both equations. The lines therefore intersect at (X, Y). 
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SOLUTION 



Scatterplot of weight vs height, weight vs height 




130 -I , , , , , , , , 

60 62 64 66 68 70 72 74 
Height 

Fig. 13-12 The regression line of weight on height and the regression line of height on weight both pass through the 
point (xbar, ybar). 

(X, Y) is the same as (xbar, ybar) and equals (66.83, 154.17). Note that weight = 3.22(66.83)- 
60.9 = 154.17 and height = 31.0 + 0.232(154.17) = 66.83. Therefore, both lines pass through (xbar, ybar). 



Table 13.8 



X' 


Y' 


x' 2 


X'Y' 


Y' 2 


5 


5 


25 


25 


25 


-2 





4 








7 


30 


49 


210 


900 


-5 


-15 


25 


75 


225 


1 


6 


1 


6 


36 


5 


18 


25 


90 


324 


9 


28 


81 


252 


784 





10 








100 


-3 


-18 


9 


54 


324 


2 


-5 


4 


-10 


25 





-11 








121 


3 




9 


6 


4 


£X' = 22 


£ Y' = 50 


E X' 2 = 232 


•£ X'Y' = 708 


•£ Y' 2 = 2868 



APPLICATIONS TO TIME SERIES 



13.19 The total agricultural exports in millions of dollars a 
Use MINITAB to do the following. 



; given in Table 13.9. 



Total value 
Coded year 



51246 53659 



53115 59364 



Source: The 2007 Statistical Abstract. 
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(a) Graph the data and show the least-squares regression line. 

(b) Find and plot the trend line for the data. 

(c) Give the fitted values and the residuals using the coded values for the years. 

(d) Estimate the value of total agricultural exports in the year 2006. 

SOLUTION 

(a) The data and the regression line are shown in Fig. 13-13 (a). The pull-down Stat — > Regression - 
Fitted line plot gives the graph shown in Fig. 13-13 (a). 




Fig. 13-13 (a) The regression line for total agricultural exports in millions of dollars. 



(b) The pull-down Stat — > Time series — > Trend Analysis gives the plot shown in Fig. 13-13 (b). It is a 
different way of looking at the same data. It may be a little easier to work with index numbers rather 
than years. 



Trend Analysis Plot for total value 
Linear Trend Model 
K= 48156.1 +2513.74*f 




Fig. 13-13 (b) The trend line for total agricultural exports in millions of dollars. 
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(c) Table 1 3 . 1 gives the fitted value 



; and the residuals for the data in Table 13.9 using coded values for the 



5 1 246 
53659 
53115 
59364 
61383 
62958 



50669.8 
53183.6 
55697.3 
58211.0 
60724.8 
63238.5 



576.19 
475.45 
-2582.30 
1152.96 
658.22 
-280.52 



(d) Using the coded value, the estimated value is F, = 48156.1 + 25 13.74(7) = 65752.3. 



13.20 Table 13.11 gives the purchasing power of the dollar as measured by consumer prices according to 
the U.S. Bureau of Labor Statistics, Survey of Current Business. 



Source: U.S. Bureau of Labor Statistics, Survey of Cur 



(a) Graph the data and obtain the trend line using MINITAB. 

(b) Find the equation of the trend line by hand. 

(c) Estimate the consumer price in 2008 assuming the trend continues for 3 r 



(a) The solid line in Fig. 13-14 shows a plot of the data in Table 13.11 and the dashed line shows the graph 
of the least-squares line. 

Trend Analysis Plot for Purchasing Power 
Linear Trend Model 
Y t = 0.5942 -0.0132*f 

Variable 
- Actual 

0.57- 
[ 0.56- 
: 0.55- 
0.54- 
0.53- 
0.52 - 
0.51 -. 



Fig. 13-14 Trend line for purchasing power. 
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(b) The computations for finding the trend line by hand are shown in Table 13.12. The equation is 
,. ».'■„ 

where x = X-X and y=Y-Y, which can be written as Y — 0.548 = — 0.0132(X — 3.5) or 
Y = -0.01 32X + 0.5942. The computational work saved by statistical software is tremendous as 
illustrated by this problem. 



0.581 
0.565 
0.556 
0.544 
0.530 
0.512 
Y= 3.288 
= 0.548 



-0.004 
-0.018 
-0.036 



6.25 
2.25 
0.25 
0.25 
2.25 
6.25 



(c) The estimation for 2008 is obtained by substituting t = 9 in the trend line equation. Estimated consumer 
price is 0.5942 - 0.0132(9) = 0.475. 



NONLINEAR EQUATIONS REDUCIBLE TO LINEAR FORM 

13.21 Table 13.13 gives experimental values of the pressure P of a given mass of gas corresponding to 
various values of the volume V. According to thermodynamic principles, a relationship having the 
form PV 1 — C, where 7 and C are constants, should exist between the variables. 

(a) Find the values of 7 and C. 

(b) Write the equation connecting P and V. 



Table 13.13 



Volume V in cubic inches (in 3 ) 


54.3 


61.8 


72.4 


88.7 


118.6 


194.0 


Pressure P in pounds per square inch (lb/in 2 ) 


61.2 


49.2 


37.6 


28.4 


19.2 


10.1 



(c) Estimate P when V = 100.0 in 3 . 



SOLUTION 

Since PV' 1 = C, we have 

log P + j log V = log C or log P = log C — 7 log V 
Calling log V = X and log P=Y, the last equation can be written 

Y = a + a t X (41) 

where a = log C and a x = -7. 
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Table 1 3 . 1 4 gives X = log V and Y = log P, corresponding to the values of V and P in Table 13.13, and 
also indicates the calculations involved in computing the least-squares line (41). The normal equations 
corresponding to the least-squares line (41) are 

J2 Y = a N + ai j:X and £ XY = a £ X + a x £ X 1 



a _ (E r)(E - (E gXE *r) _ 1 2Q = *E *r-(E *)(S y) = l40 

nj:x 2 -(zx) 2 ' 1 nj:x 2 -(j:x) 2 

Thus F= 4.20- 1.40X. 

(a) Since a = 4.20 = log C and ai = -1.40 = -7, C = 1.60 x 10 4 and 7 = 1.40. 

(b) The required equation in terms of P and Fcan be written PV L4 ° = 16,000. 

(c) When V = 100, X = log V = 2 and F = log P = 4.20 - 1.40(2) = 1.40. Then P = antilogl.4 
25.1 lb/in 2 . 



Table 13.14 



X — log V 


F = logP 


X 2 


XY 


1.7348 


1.7868 


3.0095 


3.0997 


1.7910 


1.6946 


3.2077 


3.0350 


1.8597 


1.5752 


3.4585 


2.9294 


1.9479 


1.4533 


3.7943 


2.8309 


2.0741 


1.2833 


4.3019 


2.6617 


2.2878 


1.0043 


5.2340 


2.2976 


E^ = 11.6953 


E F = 8.7975 


Y, X 1 = 23.0059 


£ JSfF = 16.8543 



13.22 Use MINITAB to assist in the solution of Problem 13.21. 
SOLUTION 

The transformations X=log,(V) and F=log,(P) converts the problem to a linear fit problem. The 
calculator in MINITAB is used to take common logarithms of volume and pressure. The following are in 
columns CI through C4 of the MINITAB worksheet. 



V 


P 


LoglOl/ 


Log10P 


54.3 


61.2 


1 .73480 


1 .78675 


61.8 


49.2 


1 .79099 


1.69197 


72.4 


37.6 


1 .85974 


1.57519 


88.7 


28.4 


1 .94792 


1 .45332 


118.6 


19.2 


2.07408 


1 .28330 


194.0 


10.1 


2.28780 


1 .00432 



The least-squares fit gives the following: log 10 (P) = 4.199- 1.402 log 10 (F). See Fig. 13-15. a = log C 
and a x = -7. Taking antilogs gives C = 10"" and 7 = -a, or C = 15812 and 7 = 1.402. The nonlinear 
equation is PV lM1 = 15812. 



340 



CURVE FITTING AND THE METHOD OF LEAST SQUARES 



[CHAP. 13 




13.23 Table 13.15 gives the population of the United States at 5-year intervals from 1960 to 2005 in 
millions. Fit a straight line as well as a parabola to the data and comment on the two fits. 
Use both models to predict the United States population in 2010. 



Table 13.15 



Year 


1960 


1965 


1970 


1975 


1980 


1985 


1990 


1995 


2000 


2005 


Population 


181 


194 


205 


216 


228 


238 


250 


267 


282 


297 



Source: U.S. Bureau of Census. 



A partial printout of the MINITAB solution for the least-squares line and least-squares parabola is given 



] 960 
1965 
1970 
1975 



1985 238 6 

1990 250 7 

1995 267 8 

2000 282 9 

2005 297 10 

The straight line model is a 
The regression equation is 

Population = 166 + 12 . 6 j 

The quadratic model is as f 
The regression equation is 
Population = 174 + 9.3 x 
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Table 13.16 gives the fitted values and residuals for the straight line fit to the data. 



Table 13.16 



Year 


Population 


Fitted value 


Residual 


1960 


181 


179.018 


1.98182 


1965 


194 


191.636 


2.36364 


1970 


205 


204.255 


0.74545 


1975 


216 


216.873 


-0.87273 


1980 


228 


229.491 


-1.49091 


1985 


238 


242.109 


-4.10909 


1990 


250 


254.727 


-4.72727 


1995 


267 


267.345 


-0.34545 


2000 


282 


279.964 


2.03636 


2005 


297 


292.582 


4.41818 



Table 13.17 gives the fitted values and residuals for the parabolic fit to the data. The sum of squares of 
the residuals for the straight line is 76.073 and the sum of squares of the residuals for the parabola is 20.042. 
It appears that, overall, the parabola fits the data better than the straight line. 



Table 13.17 



Year 


Population 


Fitted value 


Residual 


1960 


181 


182.927 


-1.92727 


1965 


194 


192.939 


1.06061 


1970 


205 


203.603 


1.39697 


1975 


216 


214.918 


1.08182 


1980 


228 


226.885 


1.11515 


1985 


238 


239.503 


-1.50303 


1990 


250 


252.773 


-2.77273 


1995 


267 


266.694 


0.30606 


2000 


282 


281.267 


0.73333 


2005 


297 


296.491 


0.50909 



To predict the population in the year 2010, note that the coded value for 2010 is 11. The straight line 
predicted value is population= 166+ 12.6x= 166+ 138.6 = 304.6 million and the parabola model predicts 
population = 174 + 9.03 x + 0.326x 2 = 174 + 99.33 + 39.446 = 312.776. 
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STRAIGHT LINES 

13.24 If 3Z + 2Y = 18, find (a) Xwhen Y = 3, (b) Fwhen X = 2, (c) Xwhen F = -5, (d) Fwhen X = -1, (e) the 
X intercept, and (f) the F intercept. 

13.25 Construct a graph of the equations (a) F = 3Z — 5 and (b) X + 2 Y = 4 on the same set of axes. In what 
point do the graphs intersect? 

13.26 (a) Find an equation for the straight line passing through the points (3, -2) and (-1, 6). 

(b) Determine the X and F intercepts of the line in part (a). 

(c) Find the value of F corresponding to X = 3 and to X = 5. 

(d) Verify your answers to parts (a), (ft), and (c) directly from a graph. 

13.27 Find an equation for the straight line whose slope is \ and whose F intercept is -3. 

13.28 (a) Find the slope and F intercept of the line whose equation is 3X - 5 F = 20. 

(b) What is the equation of a line which is parallel to the line in part (a) and which passes through the 
point (2, - 1)? 

13.29 Find (a) the slope, (b) the F intercept, and (c) the equation of the line passing through the points (5, 4) 
and (2, 8). 

13.30 Find the equation of a straight line whose X and F intercepts are 3 and —5, respectively. 

13.31 A temperature of 100 degrees Celsius (°C) corresponds to 212 degrees Fahrenheit (°F), while a temperature 
of 0°C corresponds to 32 °F. Assuming that a linear relationship exists between Celsius and Fahrenheit 
temperatures, find (a) the equation connecting Celsius and Fahrenheit temperatures, (b) the Fahrenheit 
temperature corresponding to 80 °C, and (c) the Celsius temperature corresponding to 68 °F. 



THE LEAST-SQUARES LINE 

13.32 Fit a least-squares line to the data in Table 13.18, using (a) X as the independent variable and (b) X as 
dependent variable. Graph the data and the least-squares lines, using the same set of coordinate axes. 



13.33 For the data of Problem 13.32, find (a) the values of F when X = 5 and X = 12 and (b) the value of X 
when F = 7. 

13.34 (a) Use the freehand method to obtain an equation for a line fitting the data of Problem 13.32. 
(b) Using the result of part (a), answer Problem 13.33. 

13.35 Table 13.19 shows the final grades in algebra and physics obtained by 10 students selected at random from a 
large group of students. 

(a) Graph the data. 

(b) Find a least-squares line fitting the data, using X as the independent variable. 
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(c) Find a least-squares line fitting the data, using Y as the independent variable. 

(d) If a student receives a grade of 75 in algebra, what is her expected grade in physics? 

(e) If a student receives a grade of 95 in physics, what is her expected grade in algebra? 



Table 13.19 



Algebra (X) 


75 


80 


93 


65 


87 


1 98 68 


84 77 


Physics (Y) 


82 


78 


86 


72 


91 S 


95 72 


89 74 



13.36 Table 13.20 shows the birth rate per 1000 population during the years 1998 through 2004. 

(a) Graph the data. 

(b) Find the least-squares line fitting the data. Code the years 1998 through 2004 as the whole numbers 1 
through 7. 

(c) Compute the trend values (fitted values) and the residuals. 

(d) Predict the birth rate in 2010, assuming the present trend continues. 



Table 13.20 



Year 


1998 1999 2000 2001 2002 2003 2004 


Birth rate per 1000 


14.3 14.2 14.4 14.1 13.9 14.1 14.0 



Source: U.S. National Center for Health Statistics. Vital Statistics of the United States, annual: 
National Vital Statistics Reports and unpublished data. 



13.37 Table 13.21 shows the number in thousands of the United States population 85 years and over for the years 
1999 through 2005. 

(a) Graph the data. 

(b) Find the least-squares line fitting the data. Code the years 1999 through 2005 as the whole numbers 1 
through 7. 

(c) Compute the trend values (fitted values) and the residuals. 

(d) Predict the number of individuals 85 years and older in 2010, assuming the present trend continues. 

Table 13.21 

Year 1999 2000 2001 2002 2003 2004 2005 

85 and over 4154 4240 4418 4547 4716 4867 5096 
Source: U.S. Bureau of Census. 

LEAST-SQUARES CURVES 

13.38 Fit a least-squares parabola, Y = a + a^X + a 2 X z , to the data in Table 13.22. 



Table 13.22 



344 



CURVE FITTING AND THE METHOD OF LEAST SQUARES 



[CHAP. 13 



13.39 The total time required to bring an automobile to a stop after one perceives danger is the reaction time (the 
lime between recognizing danger and applying the brakes) plus the braking time (the lime for stopping after 
applying the brakes). Table 13.23 gives the stopping distance D (in feet, of ft) of an automobile traveling at 
speeds V (in miles per hour, or mi/h) from the instant that danger is perceived. 



(a) Graph D against V. 

(b) Fit a least-squares parabola of the form D = a 

(c) Estimate D when V = 45 mi/h and 80 mi/h. 



+ aiV + a 2 V 2 to the data. 



Speed V (mi/h) 


20 30 40 50 60 70 


Stopping distance D (ft) 


54 90 138 206 292 396 



Table 13.24 shows the male and female populations of the United States during the years 1940 through 2005 
in millions. It also shows the years coded and the differences which equals male minus female. 

(a) Graph the data points and the linear least-squares best fit. 

(b) Graph the data points and the quadratic least-squares best fit. 

(c) Graph the data points and the cubic least-squares best fit. 

(d) Give the fitted values and the residuals using each of the three models. Give the sum of squares for 



residuals for all three models. 
Use each of the three models t 



predict the difference ii 
Table 13.24 



2010. 



Year 



Coded 
Male 
Female 
Difference 



Source: U.S. Bureau of Census. 



13.41 Work Problem 13.40 using the ratio of females to males instead of differences. 

13.42 Work Problem 13.40 by fitting a least squares parabola to the differences. 

13.43 The number Y of bacteria per unit volume present in a culture after X hours is given in Table 13.25. 

Table 13.25 



Number of hours (X) 


1 2 3 4 5 6 


Number of bacteria per unit volume (Y) 


32 47 65 92 132 190 275 



(a) Graph the data on semilog graph paper, using the logarithmic scale for Y and the arithmetic scale for X. 

(b) Fit a least-squares curve of the form Y = ab x to the data and explain why this particular equation 
should yield good results. 

(c) Compare the values of Y obtained from this equation with the actual values. 

(d) Estimate the value of Y when X = 7. 



13.44 In Problem 13.43, show how a graph on semilog graph paper c; 
without employing the method of least squares. 



i be used to obtain the required equation 



Correlation Theory 



CORRELATION AND REGRESSION 

In Chapter 13 we considered the problem of regression, or estimation, of one variable (the dependent 
variable) from one or more related variables (the independent variables). In this chapter we consider the 
closely related problem of correlation, or the degree of relationship between variables, which seeks to 
determine how well a linear or other equation describes or explains the relationship between variables. 

If all values of the variables satisfy an equation exactly, we say that the variables are perfectly 
correlated or that there is perfect correlation between them. Thus the circumferences C and radii r 
of all circles are perfectly correlated since C = 2irr. If two dice are tossed simultaneously 100 times, 
there is no relationship between corresponding points on each die (unless the dice are loaded); 
that is, they are uncorrelated. Such variables as the height and weight of individuals would show some 
correlation. 

When only two variables are involved, we speak of simple correlation and simple regression. When 
more than two variables are involved, we speak of multiple correlation and multiple regression. 
This chapter considers only simple correlation. Multiple correlation and regression are considered 
in Chapter 15. 



LINEAR CORRELATION 

If X and Y denote the two variables under consideration, a scatter diagram shows the location of 
points (X, Y) on a rectangular coordinate system. If all points in this scatter diagram seem to lie near a 
line, as in Figs. 14-l(a) and 14-1(Z>), the correlation is called linear. In such cases, as we have seen in 
Chapter 13, a linear equation is appropriate for purposes of regression (or estimation). 

If Y tends to increase as X increases, as in Fig. 14-l(a), the correlation is called positive, or direct, 
correlation. If Y tends to decrease as X increases, as in Fig. 14-1(6), the correlation is called negative, or 
inverse, correlation. 

If all points seem to lie near some curve, the correlation is called nonlinear, and a nonlinear equation 
is appropriate for regression, as we have seen in Chapter 13. It is clear that nonlinear correlation can be 
sometimes positive and sometimes negative. 

If there is no relationship indicated between the variables, as in Fig. 14-l(c), we say that there is 
no correlation between them (i.e., they are uncorrelated). 



345 

Copyright © 2008, 1999, 1988, 1961 by The McGraw-Hill Companies, Inc. Click here for terms of use. 



346 



CORRELATION THEORY 



[CHAP. 14 




Fig. 14-1 Examples of positive correlation, negative correlation and no correlation, (a) Beginning salary and 
years of formal education are positively correlated; (b) Grade point average (GPA) and hours spent 
watching TV are negatively correlated; (c) There is no correlation between hours on a cell phone 
and letters in last name. 



MEASURES OF CORRELATION 

We can determine in a qualitative manner how well a given line or curve describes the relation- 
ship between variables by direct observation of the scatter diagram itself. For example, it is seen 
that a straight line is far more helpful in describing the relation between X and Y for the data of 
Fig. 14- 1(a) than for the data of Fig. 14-1(6) because of the fact that there is less scattering about the 
line of Fig. 14- 1(a). 

If we are to deal with the problem of scattering of sample data about lines or curves in a quantitative 
manner, it will be necessary for us to devise measures of correlation 



THE LEAST-SQUARES REGRESSION LINES 

We first consider the problem of how well a straight line explains the relationship between two 
variables. To do this, we shall need the equations for the least-squares regression lines obtained in 
Chapter 13. As we have seen, the least-squares regression line of Y on X is 



Y = a Q + a { X 



(1) 
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where a and a\ are obtained from 
which yield 



which yield 

bo 



the normal equations 

E Y = a N + ai J2X 
J2XY=a J2X + ai J2X 2 



J2X =b N +b l J2Y 

(E gXE y 2 ) - (E y)(E 
n E ^ 2 - (E ^) 2 

^E^y-(E ^)(E y) 

# E y 2 - (E if 



(2) 

(3) 
{4) 

{5) 
(6) 



a = (E y)(E ^ 2 ) - (E *)(E 
iv E ^ 2 - (E x) 2 

a = N j:xy-(j: x)(e n 

n E ^ 2 - (E ^) 2 

Similarly, the regression line of X on F is given by 

X = b + b x Y 

where b and b\ are obtained from the normal equations 



Equations (1) and (4) can also be written, respectively, as 

'-(gf> md *-(§7> (7) 

where x = X - X and j = Y - Y. 

The regression equations are identical if and only if all points of the scatter diagram lie on a line. 
In such case there is perfect linear correlation between X and Y. 



STANDARD ERROR OF ESTIMATE 

If we let T est represent the value of Y for given values of X as estimated from equation (7), a measure 
of the scatter about the regression line of Y on X is supplied by the quantity 



which is called the standard error of estimate of Y on X. 
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If the regression line (4) is used, an analogous standard error of estimate of X on Y is denned by 



In general, s yjr ^ s X Y . 

Equation (8) can be written 



2 _ e y 2 -^o E y-fli s^y 

•^F.X — 77 



<"» 

which may be more suitable for computation (see Problem 14.3). A similar expression exists for 
equation (9). 

The standard error of estimate has properties analogous to those of the standard deviation. 
For example, if we construct lines parallel to the regression line of Y on X at respective vertical distances 
s Y .x, 2s Y .x, an d 3s Y .x fr° m it, we should find, if N is large enough, that there would be included 
between these lines about 68%, 95%, and 99.7% of the sample points. 

Just as a modified standard deviation given by 



was found useful for small samples, so a modified standard error of estimate given by 



is useful. For this reason, some statisticians prefer to define equation (8) or (9) with TV - 2 replacing 
N in the denominator. 



EXPLAINED AND UNEXPLAINED VARIATION 

The total variation of Y is defined as E {Y - Y) 2 : that is, the sum of the squares of the deviations 
of the values of Y from the mean Y. As shown in Problem 14.7, this can be written 

E(Y~ Y) 2 = J2(Y-Y esl ) 2 + j:(Y est -Y) 2 (77) 

The first term on the right of equation (77) is called the unexplained variation, while the second term 
is called the explained variation — so called because the deviations F esl - Y have a definite pattern, while 
the deviations Y - F est behave in a random or unpredictable manner. Similar results hold for the 
variable X. 



COEFFICIENT OF CORRELATION 

The ratio of the explained variation to the total variation is called the coefficient of determination. 
If there is zero explained variation (i.e., the total variation is all unexplained), this ratio is 0. If there is 
zero unexplained variation (i.e., the total variation is all explained), the ratio is 1. In other cases the ratio 
lies between and 1. Since the ratio is always nonnegative, we denote it by r 2 . The quantity r, called the 

coefficient of correlation (or briefly correlation coefficient), is given by 

r = ± / explained variation = ± / E ( Y esi - ff 
r V total variation \] J2( Y ~Y) 2 
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and varies between -1 and +1. The + and - signs are used for positive linear correlation and negative 
linear correlation, respectively. Note that r is a dimensionless quantity; that is, it does not depend on the 
units employed. 

By using equations (8) and (77) and the fact that the standard deviation of Y is 
we find that equation (72) can be written, disregarding the sign, as 

r=^~if 01 " Sy* = S Y Vl^? (14) 

Similar equations exist when X and Y are interchanged. 

For the case of linear correlation, the quantity r is the same regardless of whether X or 7 is 
considered the independent variable. Thus r is a very good measure of the linear correlation between 
two variables. 



REMARKS CONCERNING THE CORRELATION COEFFICIENT 

The definitions of the correlation coefficient in equations (72) and (14) are quite general and can be 
used for nonlinear relationships as well as for linear ones, the only differences being that F est is computed 
from a nonlinear regression equation in place of a linear equation and that the + and - signs are 
omitted. In such case equation (8), defining the standard error of estimate, is perfectly general. Equation 
(70), however, which applies to linear regression only, must be modified. If, for example, the estimating 
equation is 

Y = a + a x X + a 2 X 2 + ■■■ + a n _ x X n ~ l (15) 
then equation (70) is replaced by 

2 E y 2 ^oE Y-c h exf v.Sr-'r 

s y.x = ^ (16) 

In such case the modified standard error of estimate (discussed earlier in this chapter) is 



where the quantity N — n is called the number of degrees of freedom. 

It must be emphasized that in every case the computed value of r measures the degree of the 
relationship relative to the type of equation that is actually assumed. Thus if a linear equation is assumed 
and equation (72) or (14) yields a value of r near zero, it means that there is almost no linear correlation 
between the variables. However, it does not mean that there is no correlation at all, since there may 
actually be a high nonlinear correlation between the variables. In other words, the correlation coefficient 
measures the goodness of fit between (1) the equation actually assumed and (2) the data. Unless other- 
wise specified, the term correlation coefficient is used to mean linear correlation coefficient. 

It should also be pointed out that a high correlation coefficient (i.e., near 1 or -1) does not 
necessarily indicate a direct dependence of the variables. Thus there may be a high correlation between 
the number of books published each year and the number of thunderstorms each year. Such examples 
are sometimes referred to as nonsense, or spurious, correlations. 
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PRODUCT-MOMENT FORMULA FOR THE LINEAR CORRELATION COEFFICIENT 

If a linear relationship between two variables is assumed, equation (72) becomes 

r = E x y 
V(E* 2 )(E y 1 ) 

where x = X — X and y — Y — Y (see Problem 14.10). This formula, which automatically gives the 
proper sign of r, is called the product-moment formula and clearly shows the symmetry between Zand Y. 
If we write 



(77) 



_ E 



then s x and s Y will be recognized as the standard deviations of the variables X and Y, respectively, while 
s x and s\ are their variances. The new quantity s XY is called the covariance of X and Y. In terms of the 
symbols of formulas (18), formula (77) can be written 

r = ^- (79) 
s x s Y 

Note that r is not only independent of the choice of units of X and Y, but is also independent of the 
choice of origin. 



SHORT COMPUTATIONAL FORMULAS 

Formula (77) can be written in the equivalent form 

n j: xy - (j: x)(j: y) 



\A«E* 2 -(E ^l^E^-tE y?\ 

which is often used in computing r. 

For data grouped as in a bivariate frequency table, or bivariate frequency distribution (see Problem 
14.17), it is convenient to use a coding method, as in previous chapters. In such case, formula (20) can be 
written 



N E fi* x u Y - (E fxu x ) (E fruy) 

\/W YMu\ - (EfxU X ) 2 ][N J2f Y u 2 Y - (Efruy) 2 } 

(see Problem 14.18). For convenience in calculations using this formula, a correlation table i: 
(see Problem 14.19). 

For grouped data, formulas (18) can be written 

. ■ , \TJuxUy (TJxuA (YJjUyX\ 



(21) 



(22) 



(24) 

where c x and c Y are the class-interval widths (assumed constant) corresponding to the variables X and 
Y, respectively. Note that (23) and (24) are equivalent to formula (77) of Chapter 4. 
Formula (79) is seen to be equivalent to (27) if results (22) to (24) are used. 



EM ( TJxuA 

N \ N ) 
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REGRESSION LINES AND THE LINEAR CORRELATION COEFFICIENT 

The equation of the least-squares line Y= a + a\X, the regression line of Y on X, can be written 

Y-f = '^(X-X) or y = r -^x (25) 
*x % 

Similarly, the regression line of X on Y, X = b + b\ Y, can be written 

X -X = r ^(Y-Y) or x = ^j (26) 
s Y s Y 

The slopes of the lines in equations (25) and (26) are equal if and only if r — ± 1 . In such case the two 
lines are identical and there is perfect linear correlation between the variables X and Y. If r = 0, the lines 
are at right angles and there is no linear correlation between X and Y. Thus the linear correlation 
coefficient measures the departure of the two regression lines. 

Note that if equations (25) and (26) are written Y= a + a x X and X=b + b\Y, respectively, then 
a x b x = r 2 (see Problem 14.22). 



CORRELATION OF TIME SERIES 

If each of the variables X and Y depends on time, it is possible that a relationship may exist between 
X and Y even though such relationship is not necessarily one of direct dependence and may produce 
"nonsense correlation." The correlation coefficient is obtained simply by considering the pairs of values 
(X, Y) corresponding to the various times and proceeding as usual, making use of the above formulas 
(see Problem 14.28). 

It is possible to attempt to correlate values of a variable X at certain times with corresponding values 
of X at earlier times. Such correlation is often called autocorrelation. 



CORRELATION OF ATTRIBUTES 

The methods described in this chapter do not enable us to consider the correlation of variables 
that are nonnumerical by nature, such as the attributes of individuals (e.g., hair color, eye color, etc.). 
For a discussion of the correlation of attributes, see Chapter 12. 



SAMPLING THEORY OF CORRELATION 

The N pairs of values (X, Y) of two variables can be thought of as samples from a population of all 
such pairs that are possible. Since two variables are involved, this is called a bivariate population, which 
we assume to be a hirariatc normal distribution. 

We can think of a theoretical population coefficient of correlation, denoted by p, which is estimated 
by the sample correlation coefficient r. Tests of significance or hypotheses concerning various values of p 
require knowledge of the sampling distribution of r. For p = this distribution is symmetrical, and a 
statistic involving Student's distribution can be used. For p ^ 0, the distribution is skewed; in such case a 
transformation developed by Fisher produces a statistic that is approximately normally distributed. 
The following tests summarize the procedures involved: 

1. Test of Hypothesis p — 0. Here we use the fact that the statistic 




has Student's distribution with v = N - 2 degrees of freedom (see Problems 14.31 and 14.32). 
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2. Test of Hypothesis p = p ^ 0. Here we use the fact that the statistic 

Z = I log, (j±^j = 1.1513 logio(^) (^) 

where e = 2.71828 ... , is approximately normally distributed with mean and standard deviation 
given by 

Mz = W}^) =1.15131og 10 (i±£) (29) 
V-Poj V-PoJ VN-3 

Equations (28) and (29) can also be used to find confidence limits for correlation coefficients 
(see Problems 14.33 and 14.34). Equation (28) is called Fisher's Z transformation. 

3. Significance of a Difference between Correlation Coefficients. To determine whether two correlation 
coefficients, r l and r 2 , drawn from samples of sizes N { and N 2 , respectively, differ significantly from 
each other, we compute Z { and Z 2 corresponding to r x and r 2 by using equation (28). We then use the 
fact that the test statistic 

z ^ Zl -z 2 -p Z] _ Zi {3Q) 

<J z l -z 2 

where Mz, -z 2 = Mz, - A*z 2 



J N { - 3 N 2 - 

is normally distributed (see Problem 14.35). 



SAMPLfNG THEORY OF REGRESSION 

The regression equation Y= a + a { X is obtained on the basis of sample data. We are often 
interested in the corresponding regression equation for the population from which the sample was 
drawn. The following are three tests concerning such a population: 

1. Test of Hypothesis a x = A x . To test the hypothesis that the regression coefficient a x is equal to some 
specified value A { , we use the fact that the statistic 



s y.xI s x 

has Student's distribution with N — 2 degrees of freedom. This can also be used to find confidence 
intervals for population regression coefficients from sample values (see Problems 14.36 and 14.37). 

2. Test of Hypothesis for Predicted Values. Let Y denote the predicted value of Y corresponding to 
X = X as estimated from the sample regression equation (i.e., Y = a + a { X ). Let Y p denote the 
predicted value of Y corresponding to X = X Q for the population. Then the statistic 

* = Y °~ Yp — VN^2 = Y °- Yp (32) 

Sy.x\]N+ 1 + (X - Xfjs\ Sx.yyjl + l/N + (X Q - X) 2 /(Ns\) 

has Student's distribution with N -2 degrees of freedom. From this, confidence limits for predicted 
population values can be found (see Problem 14.38). 
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. Test of Hypothesis for Predicted Mean Values. Let Y denote the predicted value of Y corresponding 
to X — X as estimated from the sample regression equation (i.e., Y = a + a Y X ). Let Y p denote the 
predicted mean value of Y corresponding to X = X for the population. Then the statistic 

t = . Y °- ? > VF^2 = Y °- ? > _ (33) 

sy.x^ + (X - Xf/sl h.x\l^/N + (X - xf/iNsl) 

has Student's distribution with N — 2 degrees of freedom. From this, confidence limits for predicted 
mean population values can be found (see Problem 14.39). 



Solved Problems 

SCATTER DIAGRAMS AND REGRESSION LINES 

14.1 Table 14.1 shows [in inches (in)] the respective heights Xand Yof a sample of 12 fathers and their 
oldest son. 

(a) Construct a scatter diagram of the data. 

(b) Find the least-squares regression line of the height of the father on the height of the son by 
solving the normal equations and by using SPSS. 

(c) Find the least-squares regression line of the height of the son on the height of the father by 
solving the normal equations and by using STAT1STIX. 



Table 14.1 



Height X of father (in) 


65 63 67 64 68 62 70 66 68 67 69 71 


Height Y of son (in) 


68 66 68 65 69 66 68 65 71 67 68 70 



SOLUTION 

(a) The scatter diagram is obtained by plotting the points (X, Y) on a rectangular coordinate system, 
shown in Fig. 14-2. 

71 j • 

70- • 

g 69- • 




62 63 64 65 66 67 68 69 70 71 
Height of father 
Fig. 14-2 Scatter diagram of the data in Table 14.1. 

(b) The regression line of Y on X is given by Y=a + a l X, where a and a x are obtained by solving the 
normal equations. 

E Y =a Q N + a x £ X 
J2XY = a J2X + a l £ X 2 
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The sums are shown in Table 14.2, from which the normal equations become 

12a + 800a,= 811 
800a + 53418a, = 54107 

from which we find that a = 35.82 and a, =0.476, and thus Y= 35.82 + 0.476X. 

The SPSS pull-down Analyze — > Regression — ► Linear gives the following partial output: 



Coefficients 3 





Unstandardized 


Standardized 








Coefficients 


Coefficients 






Model 


B 


Std. Error 


Beta 




Sig. 


1 (Constant) 


35.825 


10.178 




3.520 


.006 


Htfather 


.476 


.153 


.703 


3.123 


.011 



"Dependent Variable: Htson. 



Opposite the word (Constant) is the value of a and opposite Htfather is the value of a L 



Table 14.2 



X 


Y 


X 2 


XY 


Y 2 


65 


68 


4225 


4420 


4624 


63 


66 


3969 


4158 


4356 


67 


68 


4489 


4556 


4624 


64 


65 


4096 


4160 


4225 


68 


69 


4624 


4692 


4761 


62 


66 


3844 


4092 


4356 


70 


68 


4900 


4760 


4624 


66 


65 


4356 


4290 


4225 


68 


71 


4624 


4828 


5041 


67 


67 


4489 


4489 


4489 


69 


68 


4761 


4692 


4624 


71 


70 


5041 


4970 


4900 


£ X = 800 


£ y = 811 


Y, X 2 = 53,418 


Y, XY = 54,107 


J2 Y 2 = 54,849 



(c) The regression line of X on Y is given by X=b Q + b\Y, where b and b\ are obtained by solving the 
normal equations 

E X =b N +b x Y, y 
£ XY = b J2 Y + b x J2 Y 2 
Using the sums in Table 14.2, these become 

12ft + 811/),= 800 
8116 + 548496, = 54107 
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from which we find that b ■■ 



8 and b x = 1.036, and thus X = -3.38 + 1.036T 



The STATISTIX pull down Statistics -» Linear models -» Linear regression gives the 
following partial output: 



Statistix 8.0 

Unweighted Least Squares Linear Regressioi 
Predictor 

Variable Coefficient Std Error 

Constant -3.37687 22.4377 

Htson 1.03640 0.33188 



of Htfather 



-0.15 
3 .12 



0.8834 
0.0108 



Opposite the word constant is the value of /> = —3.37687 and opposite Htson is the value of 
bi = 1.0364. 



14.2 Work Problem 14.1 using M1N1TAB. Construct tables giving the fitted values, 7 est 
residuals. Find the sum of squares for the residuals for both regression lines. 



The least-squares regression line of Y on X will be found first. A part of the M1N1TAB output is shown 
below. Table 14.3 gives the fitted values, the residuals, and the squares of the residuals for the regression line 
of Y on X. 





Residual 

Y - T cst 



66.79 
65.84 
67.74 
66.31 
68.22 
65.36 
69.17 
67.27 
68.22 
67.74 
68.69 
69.65 



MTB > Regress 'Y' on 1 predictor 'X' 
Regression Analysis 

The regression equation isY = 35.8 + 0.476X 

The Minitab output for finding the least-squares regression line of X on Y is as follows: 

MTB > Regress 'X' on lpredictor 'Y' 

Regression Analysis 

The regression equation is X = -3 . 4 + 1 . 04 Y 

Table 14.4 gives the fitted values, the residuals, and the squares of the residuals for the regression line 
of X on Y. 
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Table 14.4 







Fitted Value 


Residual 


Residual 


x 


y 


^cst 


x x est 


Squared 


65 


68 


67.10 


-2.10 


4.40 


63 


66 


65.03 


-2.03 


4.10 


67 


68 


67.10 


-0.10 


0.01 


64 


65 


63.99 


0.01 


0.00 


68 


69 


68.13 


-0.13 


0.02 


62 


66 


65.03 


-3.03 


9.15 


70 


68 


67.10 


2.90 


8.42 


66 


65 


63.99 


2.01 


4.04 


68 


71 


70.21 


-2.21 


4.87 


67 


67 


66.06 


0.94 


0.88 


69 


68 


67.10 


1.90 


3.62 


71 


70 


69.17 


1.83 


3.34 








Sum = 


Sum = 42.85 



The comparison of the sums of squares of residuals indicates that the fit for the least-squares regression 
line of Y on X is much better than the fit for the least-squares regression line of X on Y. Recall that the 
smaller the sums of squares of residuals, the better the regression model fits the data. The height of the father 
is a better predictor of the height of the son than the height of the son is of the height of the father. 



STANDARD ERROR OF ESTIMATE 

14.3 If the regression line of Y on X is given by Y = a + a x X, prove that the standard error of 
estimate s y x is given by 

2 _ Ey 2 -«,Ei , -"iE^ 
Syx ~ N 

SOLUTION 

The values of Y as estimated from the regression line are given by F cst = a + a x X . Thus 

2 _ E(^-rcs t ) 2 _ E(y-"o-flijQ 2 

Syx ~ N - N 

= S Y(Y - a - a,X) - a £ (T - a - - a, £ X( Y - a - a x X) 
N 

But Y.(Y-a a -a x X) = Y. Y - a N - a, £X = 

and £ X(Y - a - a x X) = £ XY - a J2 X - a, £ X 2 = 

since from the normal equations 

£ Y = a N +a,j:X 
J2XY = a J2X + a, El 2 

Thus s\, x =^ Y{Y - a ' z) = £ yl - fl ° y - ^ XY 

This result can be extended to nonlinear regression equations. 



CHAP. 14] CORRELATION THEORY 357 

14.4 If x = X - X and y = Y - Y, show that the result of Problem 14.3 can be written 

2 _ E y 1 - «i E xy 

~ N 

SOLUTION 

From Problem 14.3, with X = x + X and Y = y+Y, we have 

JV4j- = E ^ 2 -«oE Y-a t J2XY = J2(y+Y) 2 -a j:(y+Y)-a l J2(x + X)(y+Y) 
= E(/ + 2jf + F 2 )-flb(E> + ^?)-ai £ to + + + 
= £ / + 2T E >' + - a NY - a, £ xy - a x X £ > - a x Y £ x - «,iVlF 

= E y 2 + ^? 2 - «o^? - «i E - qi^t 
= E/-«i E^ + w(?-^-^) 

where we have used the results £ x = 0, £ ^ = 0, and F = a + a x X (which follows on dividing both sides 
of the normal equation £ F = a o N + a \ E X by N). 



14.5 Compute the standard error of estimate, s YX , for the c 
definition and (b) the result of Problem 14.4. 



i of Problem 14.1 by using (a) the 



SOLUTION 

(a) From Problem 14.1(A) the regression line of Y on X is Y = 35.82 + 0.476Z. Table 14.5 lists the actual 
values of Y (from Table 14.1) and the estimated values of Y, denoted by F cst , as obtained from the 
regression line; for example, corresponding to X — 65 we have F cst = 35.82 + 0.476(65) = 66.76. 
Also listed are the values F - F cst , which are needed in computing s YX : 

2 _ E (Y - F cst ) _ (1.24) 2 + (0.19) 2 + ■ ■ ■ + (0.38) 2 _ 

N " 12 - LMi 

and s y . x = VY1642= 1.28 in. 

(b) From Problems 14.1, 14.2, and 14.4 



2 E/-«iE xy 38.92 - 0.476(40.34) , r 
s Y .x = = 12 =1-6 

and s Y .x = \/F643 = 1.28 in. 



66.76 
1.24 



65.81 
0.19 



67.71 
0.29 



68.19 
0.81 



65.33 
0.67 



69.14 67.24 
-1.14 -2.24 



69.62 
0.38 



14.6 (a) Construct two lines which are parallel to the regression line of Problem 14.1 and which are 
at a vertical distance S YX from it. 
(b) Determine the percentage of data points falling between these two lines. 
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SOLUTION 



71 

70 
69 
68 
67 
66 
65 
64 

62 63 64 65 66 67 68 69 70 71 
X 

Fig. 14-3 Sixty-six percent of the data points are within Sy.x of the regression line. 

The regression line Y= 35.82 + 0.476X, as obtained in Problem 14.1, is shown as the line with the 
squares on it. It is the middle of the three lines in Fig. 14-3. There are two other lines shown in Fig. 14-3. 
They are at a vertical distance S Y .x= 1-28 from the regression line. They are called the lower and 
upper lines. 

The twelve data points are shown as black circles in Fig. 14-3. Eight of twelve or 66.7% of the data 
points are between the upper and lower lines. Two data points are outside the lines and two are on the 

EXPLAINED AND UNEXPLAINED VARIATION 

14.7 Prove that £) ( Y - Y) 2 = £) ( Y - 7 esl ) 2 + £) ( F est - ff. 
SOLUTION 

Squaring both sides of Y - f = {Y - F est ) + (F est - F) and then summing, we have 

E (Y - ff = £ (F - F cst ) 2 + £ (F cst - ff + 2 £ (F - F cst )(F cst - F) 

The required result follows at once if we can show that the last sum is zero; in the case of linear regression, 
this is so since 

J2(Y-Y cst )(Y cst -Y) = j:(Y-a -a ] X)(a + a i X- F) 

= a J2(Y-a -a l X) + a l £ X( Y - a - a x X) - Y £ ( F - a - a,X) = 
because of the normal equations £ ( F - a - a t X) = and £ X( Y - a - a t X) = 0. 

The result can similarly be shown valid for nonlinear regression by using a least-squares curve given by 
F cst = a + a, X + a 2 X 2 + ■■■+ a„X" ■ 

14.8 Compute (a) the total variation, (b) the unexplained variation, and (c) the explained variation for 
the data in Problem 14.1. 

SOLUTION 

The least-squares regression line is F est = 35.8 + 0.476Z. From Table 14.6, we see that the total variation 
= £(F- f f = 38.917, the unexplained variation = £ (F - F cst ) 2 = 19.703, and the explained variation 
= £(I^t- Y) 2 = 19.214. 




Table 14.6 



Y 


y 

est 


Y 2 


y 2 


( Y cst - Y) 2 


68 


66.7894 


0.1739 


1.46562 


0.62985 


66 


65.8366 


2.5059 


0.02669 


3.04986 


68 


67.7421 


0.1739 


0.06650 


0.02532 


65 


66.3130 


6.6719 


1.72395 


1.61292 


69 


68.2185 


2.0079 


0.61074 


0.40387 


66 


65.3602 


2.5059 


0.40930 


4.94068 


68 


69.1713 


0.1739 


1.37185 


2.52257 


65 


67.2657 


6.6719 


5.13361 


0.10065 


71 


68.2185 


11.6759 


7.73672 


0.40387 


67 


67.7421 


0.3399 


0.55075 


0.02532 


68 


68.6949 


0.1739 


0.48286 


1.23628 


70 


69.6476 


5.8419 


0.12416 


4.26273 


Y = 67.5833 




Sum = 38.917 


Sum = 19.703 


Sum= 19.214 



The following output from MINITAB gives these same sums of squares. They are shown in bold. 
Note the tremendous amount of computation that the software saves the user. 

MTB > Regress 'Y' 1 'X' ; 
SUBO Constant; 
SUBO Brief 1. 

Regression Analysis 

The regression equation is 
Y = 35.8 + 0.476 X 

Analysis of Variance 

Source DF 

Regression 1 

Residual Error 10 

Total 11 



SS MS F P 

19.214 19.214 9.75 0.011 

19.703 1.970 
38.917 



COEFFICIENT OF CORRELATION 

14.9 Use the results of Problem 14.8 to find (a) the coefficient of determination and (b) the coefficient 
of correlation. 

SOLUTION 

. „. . . , . . explained variation 19.214 

(a) Coefficient of determination = r= — — — = — — — = 0.4937 

total variation 38.917 

(b) Coefficient of correlation = r = ±^0.4937 = ±0.7027 

Since X and Y are directly related, we choose the plus sign and have two decimal places r = 0.70. 

14.10 Prove that for linear regression the coefficient of correlation between the variables X and Fcan be 
written 

V(E * 2 )(E y 2 ) 

where x = X - X and y=Y-Y. 
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SOLUTION 

The least-squares regression line of Y on X can be written F cst = a + a x X c 
[see Problem 13.15(a)] 

„. - =L^Z and y est = F cst - Y 



E 

explained 



__ E(Y est -Y) 2 = EyL 

total variation J2(Y - Y) 2 E >' 2 



E 
~ E-' 2 



and 

However, since the quantity 
is positive when y t 



E>' 2 vE*V E>' 2 (E^XE/) 

V(E * 2 )(E /) 
E^ 



vXE^XE/) 



;s (i.e., positive linear correlation) and negative when j> est decreases 
e linear correlation), it automatically has the correct sign associated with it. Hence 
we define the coefficient of linear correlation to be 



y/(E* 2 XE 7) 

This is often called the product-moment formula for the linear correlation coefficient. 



PRODUCT-MOMENT FORMULA FOR THE LINEAR CORRELATION COEFFICIENT 

14.11 Find the coefficient of linear correlation between the variables X and Y presented in Table 14.7. 
Table 14.7 



The work involved in the computation can be organized a 

E xy 



~\/(E * 2 )(E 7) V(132)(56) 



Table 14.8. 
= 0.977 



This shows that there is a very high linear correlation between the variables, as we have already 
observed in Problems 13.8 and 13.12. 



X 


Y 


x = X-X 


y = Y — Y 








1 


1 


-6 


-4 


36 


24 


16 


3 


2 


-4 


-3 


16 


12 


9 


4 


4 


-3 


-1 


9 


3 


1 


6 


4 


-1 


-1 


1 


1 


1 


8 


5 


1 





1 








9 


7 






4 


4 


4 


11 


8 


4 


3 


16 


12 


9 


14 


9 


7 


4 


49 


28 


16 


EX = 56 
X = 56/8 = 7 


£ F = 40 
Y = 40/8 = 5 






E x 2 = 132 


E xy = 84 


E / = 56 
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14.12 In order to investigate the connection between grade point average (GPA) and hours of TV 
watched per week, the pairs of data in Table 14.9 and Fig. 14-4 were collected and EXCEL 
was used to make a scatter plot of the data. The data was collected on 10 high school students and 
X is the number of hours the subject spends watching TV per week (TV hours) and Y is 
their GPA: 

Table 14.9 

TV hours GPA 

20 2.35 

5 3.8 

8 3.5 

10 2.75 

13 3.25 

7 3.4 

13 2.9 
5 3.5 

25 2.25 

14 2.75 



4n 1 

3.8 - ♦ 
3.6- 

3.4- ♦ 
3.2- 

< 

Q- 3 - 

<3 ♦ 
2.8- 4 , 

2.6- 

2.4- ^ 

2.2 - * 

2 -I , , , , , 

5 10 15 20 25 30 

TV hours 

Fig. 14-4 EXCEL scatter plot of data in Problem 14.12. 

Use EXCEL to compute the correlation coefficient of the two variables and verify it by using the 
product-moment formula. 

SOLUTION 

The EXCEL function =CORREL (E2 :Ell,F2:Fll)is used to find the correlation coefficient, if the 
TV hours are in E 2 : E 1 1 and the GPA values are in F 2 : F 1 1. The correlation coefficient is -0.9097. The 
negative value means that the two variables are negatively correlated. That is, as more hours of TV are 
watched, the lower is the GPA. 



14.13 A study recorded the starting salary (in thousands), Y, and years of education, X, for 10 workers. 
The data and a SPSS scatter plot are given in Table 14.10 and Fig. 14-5. 
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Starting salary 


Years of Education 


35 


12 


46 


16 


48 


16 


50 


15 


40 


13 


65 


19 


28 


10 


37 


12 


49 


17 


55 


14 



is 50 00 



15.00 18.00 
Education 

tter plot for Problem 14.13. 



12.00 

Fig. 14-5 SPSS 

Use SPSS to compute the correlation coefficient of the 
product-moment formula. 

SOLUTION 



) variables and verify it by using the 



Correlations 





startsal 


education 


startsal Pearson Correlation 
Sig. (2-tailed) 
N 


1 

10 


.891** 
.001 
10 


education Pearson Correlation 
Sig. (2-tailed) 
N 


.891** 
.001 
10 


1 

10 



"Correlation is significant at the 0.001 level (2-tailed). 
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The SPSS pull-down Analyze — > Correlate — ► Bivariate gives the correlation by the product- 
formula. It is also called the Pearson correlation. 



The output above gives the coefficient of correlati 



14.14 A study recorded the hours per week on a cell phone, Y, and letters in the last name, 
X, for 10 students. The data and a STAT1STIX scatter plot are given in Table 14.11 and 
Fig. 14-6. 

Table 14.11 



n Cell Phont 



Letters in Last Name 



Fig. 14-6 STATISTIX scatter plot of data in Table 14.11. 

Use STATISTIX to compute the correlation coefficient of the two variables and verify it by using the 
product-moment formula. 



The pull-down ' 
following output: 

Statistix 8.0 
Correlations (Pearson) 



Statistics — > Linear models — > correlations (Pearson) " gives the 



P-VALUE 



-0.4701 
0.1704 



CORRELATION THEORY 



The coefficient of correlation is r= — 0.4701. There is no significant correlation between these two 
variables. 



14.15 Show that the linear correlation coefficient is given by 

n j: xy - (j: x)(j: y) 



\Me* 2 -(E E f2 -(E y) 2 ] 

SOLUTION 

Writing x = X - X and y = Y - F in the result of Problem 14.10, we have 
E*>' _ ll(X-X)(Y-Y) 

V(E-* 2 )(E/) ^E(jr-z) 2 ]E(T-?) 2 ] 
But £ (z - x)( r - f) = £ (^r - xf - xF + xF) = e xf - x e Y-YY.X + NXY 

= E XF-iVZF-AfFZ + iVZF = X; XF-iVXF 
since X = (E X)/JV and F = (£ ^)/^- Similarly, 

E - x) 2 = E (^ 2 - 2xx + x 2 ) = Y,x 2 ~2xj2x + nx 2 



(34) 



and J2(Y- ?)- = •£ Y' 

Thus equation (34) becomes 

E^f-(E *)(£ y)/n 



(E if 

iV 

^E^-(E *)(£ y) 



VlE ^ 2 - (E rf/N][j: y 2 - (£ y) 2 /n] ^E^ 2 -(E ^I^E^-IE if] 



14.16 The relationship between being overweight and level of high blood pressure was researched in 
obese adults. Table 14.12 gives the number of pounds overweight and the number of units over 80 
of the diastolic blood pressure. A SAS scatter plot is given in Fig. 14-7. 



Pounds over Weight 


Units over 80 


75 
86 


15 
13 


88 


10 


125 


27 


75 


20 


30 


5 


47 


8 


150 


31 


114 


28 


68 


22 
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40 



0- | 

30 40 50 60 70 80 90 100 110 120 130 140 150 
Overweight 

Fig. 14-7 SAS scatter plot for Problem 14.16. 



Use SAS to compute the correlation coefficient of the two variables and verify it by using the 
product-moment formula. 



The SAS pull-down Statistics — > Descriptive — > Correlations gives the correlation procedure, part of 
which is shown. 

The CORR Procedure 

Overwt Over 8 

Overwt 1.00000 0.85536 

Overwt 0.0016 
Over80 0.85536 1.00000 

Over80 0.0016 
The output gives the coefficient of correlation as 0.85536. There is a significant 
correlation between how much overweight the individual is and how far over 80 their diastolic blood 
pressure is. 



CORRELATION COEFFICIENT FOR GROUPED DATA 

14.17 Table 14.13 shows the frequency distributions of the final grades of 100 students in mathematics 
and physics. Referring to this table, determine: 

(a) The number of students who received grades of 70-79 in mathematics and 80-89 in physics. 

(b) The percentage of students with mathematics grades below 70. 

(c) The number of students who received a grade of 70 or more in physics and of less than 80 in 
mathematics. 

(d) The percentage of students who passed at least one of the subjects; assume that the minimum 
passing grade is 60. 
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SOLUTION 

(a) In Table 14.13, proceed down the column headed 70-79 (mathematics grade) to the row marked 80-89 
(physics grade), where the entry is 4, which is the required number of students. 



Table 14.13 







Mathematics Grades 






40-49 


50-59 


60-69 


70-79 


80-89 


90-99 


Total 




90-99 








2 


4 


4 


10 




80-89 






1 


4 


6 


5 


16 


cs Grades 


70-79 






5 


10 




1 


24 


60-69 


1 


4 


9 


5 






21 




50-59 


3 


6 


6 








17 




40^19 


3 


5 


4 








12 




Total 


7 


15 


25 


23 


20 


10 


100 



(/>) The (olal number of sludenls with malhemancs grades below 70 is the number with grades 46 — 1-6 + (lie 
number with grades 50-59 + the number with grades 60-69 = 7+15 + 25 = 47. Thus the required 
percentage of students is 47/100 = 47%. 

(c) The required number of students is the total of the entries in Table 14.14 (which represents part of 
Table 14.13). Thus the required number of students is 1 + 5 + 2 + 4 + 10 = 22. 

(d) Table 14.15 (taken from Table 14.13) shows that the number of students with grades below 60 in both 
mathematics and physics is 3 + 3 + 6 + 5 = 17. Thus the number of students with grades 60 or over in 
either physics or mathematics or in both is 100 — 17 = 83, and the required percentage is 
83/100 = 83%. 





Mathematics 
Grades 




60-69 


70-79 




90-99 




2 


Physics 
Grades 


80-89 


1 


4 




70-79 


5 


10 





Mathematics 
Grades 


40-49 


50-59 


Physics 
Grades 


50-59 


3 


6 


40-49 


3 


5 



Table 14.13 is sometimes called a bivariate frequency table, or bivariate frequency distribution. 
Each square in the table is called a cell and corresponds to a pair of classes or class intervals. The 
number indicated in the cell is called the cell frequency. For example, in part (a) the number 4 is 
the frequency of the cell corresponding to the pair of class intervals 70-79 in mathematics and 80-89 in 
physics. 

The totals indicated in the last row and last column are called marginal totals, or marginal frequencies. 
They correspond, respectively, to the class frequencies of the separate frequency distributions of the mathe- 
matics and physics grades. 



CHAP. 14] 



CORRELATION THEORY 



367 



14.18 Show how to modify the formula of Problem 14.15 for the case of data grouped as in the bivariate 
frequency table (Table 14.13). 

SOLUTION 

For grouped data, we can consider the various values of the variables X and Y as coinciding with the 
class marks, while f x and f Y are the corresponding class frequencies, or marginal frequencies, shown in the 
last row and column of the bivariate frequency table. If we let / represent the various cell frequencies 
corresponding to the pairs of class marks (X,Y), then we can replace the formula of Problem 14.15 with 

NJ2fXY-(J2f x X)(j:f Y Y) 

y/[N E fx* 2 - (E hxf] [N E fy Y 2 - (E fy Y) 2 } 

If we let X = A + c x u x and Y = B + c Y u Y , where c x and c Y are the class-interval widths (assumed 
constant) and A and B are arbitrary class marks corresponding to the variables, formula (35) becomes 
formula (21) of this chapter: 

r = N'£fi tx u r -C£f x u x )C£fru Y ) 

yJWZfxti ~ (E fxU x f][N E fyu\ - (E fyuyf] 

This is the coding method used in previous chapters as a short method for computing means, standard 
deviations, and higher moments. 



14.19 Find the coefficient of linear correlation of the mathematics and physics grades of Problem 14.17. 
SOLUTION 

We use formula (21). The work can be arranged as in Table 14.16, which is called a correlation table. 
The sums E fx, E fx u x, E fx u x, E /r> E fy u y > an d E fy u Y are obtained by using the coding method, 
as in earlier chapters. 

The number in the corner of each cell in Table 14.16 represents the product fii x ii Y , where /is the cell 
frequency. The sum of these corner numbers in each row is indicated in the corresponding row of the lasi 
column. The sum of these corner numbers in each column is indicated in the corresponding column of the 
last row. The final totals of the last row and last column are equal and represent J2f u xUy- 

From Table 14.16 we have 

Nj2fu X U Y -(j:f X U X )(j:f Y U Y ) 

\/[^E/i-4 - (Efxu x ) 2 ][N Y.fyu 2 Y - (E/r"r) 2 ] 
= (100)(125)-(64)(-55) = 16,020 = Q ^ 

a/[(100(236) - (64) 2 ][(100)(253) - (-55) 2 ] \/(19,504)(22,275) 



14.20 Use Table 14.16 to compute (a) s x , (b) s Y , and (c) s X y and thus to verify the formula 

r = s XY /{s x s Y ). 




= 160.20 



Thus the standard deviations of the mathematics and physics grades are 14.0 and 14.9, respectively, while 
their covariance is 160.2. The correlation coefficient r is therefore 
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Table 14.16 







Mathematics Grades X 








X 


44.5 


54.5 


64.5 


74.5 


84.5 


94.5 








Sum of 




Y 


\ u x 

Uy \ 














fr 


fyUy 


fyu] 


numbers 

Tw ch 




94.5 


2 








2 

RT 


4 

[16 


4 


10 


20 


40 


44 




84.5 


1 






1 

RT 


4 

[4 


6 

RY 


5 


16 


16 


16 


31 




74.5 








RT 


(0 


ro 


R> 












64.5 


-1 


1 


4 


9 

RT 


5 

R 5 


rr 
4 




21 


-21 


21 


-3 




54.5 




3 

w 


6 


6 

[0 


2 

F 






17 


-34 


68 


20 




44.5 


-3 


3 


5 

f 5 


4 

R> 








12 


-36 


108 


33 




/ 




7 


15 


25 


23 


20 


10 


E/x = E /r 
= TV = 100 


E fvuy 
= -55 


E fyUy 
= 253 


E f u xUy 
= 125 




fxux 


-14 


-15 





23 


40 




= 64 










fxul 


28 


15 





23 


80 


90 


Efxux 
= 236 


✓ 






Sum of corner 
numbers in 
each column 


32 


31 





-1 


24 


39 


E /«x"y 
= 125 









s x s Y (13.966)(14.925) 

agreeing with Problem 14.19. 



REGRESSION LINES AND THE CORRELATION COEFFICIENT 

14.21 Prove that the regression lines of Y on X and of Ion Y have equations given, respectively, by 
(a) Y — Y = (rs Y /s x )(X - Y) and (b) X- X = (rs x /s r )(Y - Y). 

SOLUTION 

(a) From Problem 13.15(a), the regression line of Y on X has the equation 

-(If)' » '-'-(&>-» 

Then, since r - '■ _ (see Problem 14.10) 

V(E* 2 )(E/) 



CORRELATION THEORY 



and the required result follows. 
(b) This follows by interchanging X and Y in part (a). 

14.22 If, the regression lines of Y on X and of X on Y are given, respectively, by Y = a + a x X and 
X = b + b x Y, prove that a x b x = r 2 . 

SOLUTION 

From Problem 14.21, parts (a) and (h). 

a x = — and b x = — 

This result can be taken as the starting point for a definition of the linear correlation coefficient. 

14.23 Use the result of Problem 14.22 to find the linear correlation coefficient for the data of 
Problem 14.1. 



From Problem 14.1 [parts (b) and (c), respectively] a x = 484/1016 = 0.476 and b x = 484/467 = 1.036. 
Thus Y 2 = a x b x = (384/1016)(484/467) and r = 0.7027. 

14.24 For the data of Problem 14.19, write the equations of the regression lines of (a) Y on X and 
(b) X on Y 

SOLUTION 

From the correlation table (Table 14.16) of Problem 14.19 we have 

X^ + Cx ^ = 64.5 + M = 70.9 

f^ +Cy ^^74.5 + ( 10 ^ 55 ^ 6 9 .0 
N 100 

From the results of Problem 14.20, s x = 13.966, s Y = 14.925, and r = 0.7686. We now use Problem 14.21, 
parts (a) and (b), to obtain the equations of the regression lines. 

(a) Y — Y = — ^ (X — X) Y - 69.0 = 

(b) X-X = r -^(Y-Y) X- 70.9 = (?^M (y . 69 . 0) = 0.719(7 -69.0) 

14.25 For the data of Problem 14.19, compute the standard errors of estimate (a) s Y .x an d (b) s x .y- 
Use the results of Problem 14.20. 



(a) s Y . x = s Y Vl -r 2 = 14.925^ 1 - (0.7686) 2 = 9.5- 

(b) s x . Y = SxVl-r 2 = 13.966^1 - (0.7686) 2 = 8.9: 
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14.26 Table 14.17 shows the United States consumer price indexes for food and medical-care costs 
during the years 2000 through 2006 compared with prices in the base years, 1982 to 1984 
(mean taken as 100). Compute the correlation coefficient between the two indexes and give the 
MINITAB computation of the coefficient. 



272.8 285.6 



Source: Bureau of Labor Statistics. 



SOLUTION 

Denoting the index numbers for food and medical care as A!" and ) . respectively, the calculation of the 
correlation coefficient can be organized as in Table 14.18. (Note that the year is used only to specify the 
corresponding values of X and 7.) 



190.7 
195.2 



260.8 
272.8 
285.6 



336.2 
7 = 298.0 



182.25 502.20 

67.24 206.64 

26.01 63.24 
1.69 1.17 

24.01 59.29 

88.36 236.88 

193.21 530.98 
Sum =582.77 Sum = 1600.4 



Then by the product-moment formula 

r= ^ = =0-998 
\/(£* 2 )(£ 7) \/(582.77)(4414.14) 

After putting the X values in CI and the Y values in C2, the MINITAB command correlation ci C2 
produces the correlation coefficient which is the same that we computed. 
Correlations: X, Y 

Pearson correlation of X and Y = .998 
P-Value= . 000 



NONLINEAR CORRELATION 



14.27 Fit a least-squares parabola of the form 7= a + a { X '+ a 2 X 2 to the set of data in Table 14.19. 
Also give the MINITAB solution. 
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SOLUTION 

The normal equations (23) of Chapter 13 are 

E Y =" N + ai J2X +a 2 J2X 2 

J2XY =a J2X + a] J2X 2 + a 2 j:X 3 (36) 
J2X 2 Y = a J2X 2 + a l j:X 3 + a 2 j: X 4 

Table 14.19 

X 1.2 1.8 3.1 4.9 5.7 7.1 8.6 9.8 
Y 4.5 5.9 7.0 7.8 7.2 6.8 4.5 2.7 



The work involved in computing the sums can be arranged as in Table 14.20. Then, since N = 8, the normal 
equations (36) become 

8a„ + 42.2a, + 291.20a 2 = 46.4 

42.2a + 291.20a, + 2275.35a 2 = 230.42 (37) 

291.20a + 2275.35a, + 18971. 92^ = 1449.00 

Solving, a = 2.588, a, = 2.065, and a 2 = -0-2110; hence the required least-squares parabola has the 
equation 

Y = 2.588 + 2.065Z - 0.21 10X 2 



Table 14.20 



X 


Y 


X 2 


X 3 


X 4 


XY 


X 2 Y 


1.2 


4.5 


1.44 


1.73 


2.08 


5.40 


6.48 


1.8 


5.9 


3.24 


5.83 


10.49 


10.62 


19.12 


3.1 


7.0 


9.61 


29.79 


92.35 


21.70 


67.27 


4.9 




24.01 


117.65 


576.48 


38.22 


187.28 


5.7 


7.2 


32.49 


185.19 


1055.58 


41.04 


233.93 


7.1 


6.8 


50.41 


357.91 


2541.16 


48.28 


342.79 


8.6 


4.5 


73.96 


636.06 


5470.12 


38.70 


332.82 


9.8 


2.7 


96.04 


941.19 


9223.66 


26.46 


259.31 


E^ 


E y 


E x 2 


E^ 3 


EX 4 


E xy 


E^ 2 F 


= 42.2 


= 46.4 


= 291.20 


= 2275.35 


= 18,971.92 


= 230.42 


= 1449.00 



The values for Fare entered into CI, the Values for X are entered into C2, and the values for X 2 are entered 
into C3. The MIN1TAB pull-down Stat — > Regression — > Regression is given. The least-squares 
parabola is given as part of the output as follows. 
The regression equation isY = 2. 59 + 2. 06X-0. 211 Xsquared 
This is the same solution as given by solving the normal equations. 

14.28 Use the least-squares parabola of Problem 14.27 to estimate the values of Y from the given 
values of X. 
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For X=1.2, F cst = 2.588 + 2.065(1.2) - 0.21 10(1.2) 2 = 4.762. Other estimated values are obtained 
similarly. The results are shown in Table 14.21 together with the actual values of F. 



4.762 5.621 6.962 7.640 7.503 6.613 4.741 2.561 



14.29 (a) Find the linear correlation coefficient between the variables X and F of 
Problem 14.27. 

(b) Find the nonlinear correlation coefficient between these variables, assuming the parabolic 
relationship obtained in Problem 14.27. 

(c) Explain the difference between the correlation coefficients obtained in parts (a) and (b). 
{d) What percentage of the total variation remains unexplained by assuming a parabolic 

relationship between X and F? 

SOLUTION 

(a) Using the calculations already obtained in Table 14.20 and the added fact that £F 2 = 290.52, we 
find that 

nExy-q: f) 
\/[NJ2 xl - (E x) 2 ][nj: f 2 - (E f) 2 ] 

= (8)(230.42) - (42.2)(46.4) = _ Q ^ 

Y / [(8)(291.20) - (42.2) 2 ][(8)(290.52) - (46.4) 2 ] 

(b) From Table 14.20, Y = (£ Y)/N = 46.4/8 = 5.80; thus the total variation is £ (F - F) 2 = 21.40. 
From Table 14.21, the explained variation is £ (F est - F) 2 = 21.02. Thus 



:xplained variation 21.02 



and r = 0.9911 or 0.99 



(c) The fact that part (a) shows a linear correlation coefficient of only —0.3743 indicates that there is 
practically no linear relationship between X and F. However, there is a very good nonlinear relationship 
supplied by the parabola of Problem 14.27, as indicated by the fact that the correlation coefficient in 
part (b) is 0.99. 

{d) Unexplained variation = { ? = { q.9822 = 0.0178 

Total variation 

Thus 1.78% of the total variation remains unexplained. This could be due to random fluctuations or to 
an additional variable that has not been considered. 



14.30 Find (a) s Y and (b) s Y .x for the data of Problem 14.27. 
SOLUTION 

(a) From Problem 14.29(a), ]T ( F - F) 2 = 21.40. Thus the standard deviation of F is 

-= 1.636 or 1.64 



CHAP. 14] 



CORRELATION THEORY 



373 



(b) First method 

Using part (a) and Problem 14.29(6), the standard error of estimate of Y on X is 
s YX = Sy V \-r 2 = 1.636a/ 1 - (0.9911) 2 = 0.218 or 0.22 

Second method 

Using Problem 14.29, 

_ fy fe (Y - Test) 2 _ ^elplained variation _ ^ 21.40 - 21.02 _ Q ^ ^ Q2 
Third method 

Using Problem 14.27 and the additional calculation J2 Y 2 = 290.52, we have 
s Y . x = JZY>-a i:Y- fl.E XY - a 2 E X 2 Y = Q ^ ^ ^ 



SAMPLING THEORY OF CORRELATION 

14.31 A correlation coefficient based on a sample of size 18 was computed to be 0.32. Can we conclude 
at significance levels of (a) 0.05 and (b) 0.01 that the corresponding population correlation 
coefficient differs from zero? 

SOLUTION 

We wish to decide between the hypotheses H : p = and H x : p > 0. 

^/N^2 0.32V 7 



yjl - (0.32) 2 

(a) Using a one-tailed test of Student's distribution at the 0.05 level, we would reject H if t > 1 95 = 1.75 
for (18 — 2) = 16 degrees of freedom. Thus we cannot reject H at the 0.05 level. 

(b) Since we cannot reject H at the 0.05 level, we certainly cannot reject it at the 0.01 level. 

14.32 What is the minimum sample size necessary in order that we may conclude that a correlation 
coefficient of 0.32 differs significantly from zero at the 0.05 level? 

SOLUTION 

Using a one-tailed test of Student's distribution at the 0.05 level, the minimum value of N must be such 



V 1 ~ (°- 32 ) 2 

for N - 2 degrees of freedom. For an infinite number of degrees of freedom, t 95 = 1.64 and hence N = 25.6. 
For TV = 26: ^ = 24 f 95 = 1.71 t = 0.32%/24/a/ 1 - (0.32) 2 = 1.65 
For TV = 27: ^ = 25 f 95 = 1.71 t = 0.32\/25/ yj 1 - (0.32) 2 = 1.69 
For = 28: ^ = 26 f 95 = 1.71 t = 0.32\/26/ 1 \J 1 - (0.32) 2 = 1.72 

Thus the minimum sample size is N = 28. 
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A correlation coefficient on a sample of size 24 was computed to be r — 0.75. At the 0.05 
significance level, can we reject the hypothesis that the population correlation coefficient is as 
small as (a) p = 0.60 and (b) p = 0.50? 

SOLUTION 

(a) Z= 1.1513 




Using a one-tailed test of the normal distribution at the 0.05 level, we would reject the hypothesis only if 
z were greater than 1.64. Thus we cannot reject the hypothesis that the population correlation coefficient 
is as small as 0.60. 

(b) If p = 0.50, then p z = 1.15131og3 = 0.5493 and z = (0.9730 - 0.5493)/0.2182 = 1.94. Thus we 
can reject the hypothesis that the population correlation coefficient is as small as p = 0.50 at the 
0.05 level. 



14.34 The correlation coefficient between the final grades in physics and mathematics for a group of 
21 students was computed to be 0.80. Find the 95% confidence limits for this coefficient. 

SOLUTION 

Since r = 0.80 and N = 21, the 95% confidence limits for p z are given by 
Z±1.96cr z = 1.15131og ^ 
Thus p z has the 95% confidence interval 0.5366 to 1.5606. Now if 
p z = 1. 1513 log = 



= 0.5366 then 



and if nz= 1. 1513 log (j-J- = 1.5606 then p = 0.9155 

Thus the 95% confidence limits for p are 0.49 and 0.92. 



14.35 Two correlation coefficients obtained from samples of size N { = 28 and 7Y 3 = 35 were computed 
to be r x = 0.50 and r 2 = 0.30, respectively. Is there a significant difference between the two 
coefficients at the 0.05 level? 



Zj = 1. 1513 log (jzryj = - 5493 z 2 = l15131 °s(y^7) = - 3095 



^ = V^ + ^3 = °- 2669 
We wish to decide between the hypotheses H : p Z[ = /i z , and H x : p Z[ ^ p Zr Under hypothesis H , 

Z l -Z 2 -(p Zi -p Zi ) 0.5493 - 0.3095 -0 
2 = ^ = 0^669 = °- 8985 



SOLUTION 



and 



Using a two-tailed test of the normal distribution, we would reject H only if z > 1.96 or z < -1.96. Thus we 
cannot reject H , and we conclude that the results are not significantly different at the 0.05 level. 
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14.36 In Problem 14. 1 we found the regression equation of Y on X to be Y = 35.82 + 0.476X. Test the 
null hypothesis at the 0.05 significance level that the regression coefficient of the population 
regression equation is 0.180 versus the alternative hypothesis that the regression coefficient 
exceeds 0.180. Perform the test without the aid of computer software as well as with the aid of 
MINITAB computer software. 



SOLUTION 

ai-A r—— 0.476- 0.180 r- — - ,„ 
t = S^d^ 2 = 1-28/2.66 
since S Fjr =1.28 (computed in Problem 14.5) and S x = \/(E x 2 )/N= ^84.68/12 = 2.66. Using a 
one-tailed test of Student's distribution at the 0.05 level, we would reject the hypothesis that the 
regression coefficient is 0.180 if t> t 95 = 1.81 for (12-2) = 10 degrees of freedom. Thus, we reject 
the null hypothesis. 

The MINITAB output for this problem is as follows. 
MTB > Regress ' Y' 1 ' X ' ; 
SUBO Constant; 
SUBO Predict c7 . 



Regression Analysis 

The regression equation is 
Y = 35.8 + 0.476 X 

Predictor Coef StDev T P 

Constant 35.82 10.18 3.52 0.006 

X 0.4764 0.1525 3.12 0.011 

S = 1.404 R-Sq = 49.4% R-Sq ( ad j ) = 44 . 3% 

Analysis of Variance 

Source DF SS MS F P 

Regression 1 19.214 19.214 9.75 0.011 

Residual Error 10 19.703 1.970 

Total 11 38.917 

Predicted Values 

Fit StDev Fit 95.0% CI 95.0% PI 

66.789 0.478 (65.724, 67.855) (63.485, 70.094) 

69.171 0.650 (67.723, 70.620) (65.724, 72.618) 

The following portion of the output gives the information needed to perform the test of hypothesis. 
Predictor Coef StDev T P 

Constant 35.82 10.18 3.52 0.006 

X 0.4764 0.1525 3.12 0.011 



The computed test statistic is found as follows: 

0.4764- 0.180 

t = = 1.94 

0.1525 

The computed t value shown in the output, 3 . 12, is used for testing the null hypothesis that the regression 
coefficient is 0. To test any other value for the regression coefficient requires a computation like the 
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one shown. To test that the regression coefficient is 0.25, for example, the computed value of the tesl statistic 
would equal 

0.4764- 0.25 , AO 
'= 0.1525 =L48 
The null hypothesis that the regression coefficient equals 0.25 would not be rejected. 

14.37 Find the 95% confidence limits for the regression coefficient of Problem 14.36. Set the confidence 
interval without the aid of any computer software as well as with the aid of MINITAB computer 
software. 

SOLUTION 

The confidence interval may be expressed as 



Thus the 95% confidence limits for A\ (obtained by setting t 
freedom) are given by 



= ±2.23 for 12 - 



.476^(^=0.4 

VTo V 2 - 66 / 

'a confident that A x lies between 0.136 and 0.816. 
The following portion of the MINITAB output from Problem 14.36 g 
the 95% confidence interval. 



Coef 
35.82 
0.4764 



StDev 
10. 18 
0.1525 



3.52 
3.12 



0.006 
0.011 



is the information needed tc 



is sometimes called the standard error associated with the estimated regression coefficient. The value for this 
standard error is shown in the output as 0.1525. To find the 95% confidence interval, we multiply this 
standard error by t 915 and then add and subtract this term from a, = 0.476 to obtain the following 
confidence interval for A i : 

0.476 ± 2.23(0.1525) = 0.476 ± 0.340 



14.38 In Problem 14.1, find the 95% confidence limits for the heights of sons whose fathers' heights are 
(a) 65.0 and (b) 70.0 in. Set the confidence interval without the aid of any computer software as 
well as with the aid of MINITAB computer software. 

SOLUTION 

Since t g75 = 2.23 for (12 - 2) = 10 degrees of freedom, the 95% confidence limits for Y P are given by 
where F = 35.82 + 0.476X , S Y .x = 1-28, S x = 2.66, and N = 12. 

(a) If X = 65.0, then Y = 66.76 in. Also, {X - X) 2 = (65.0 - 66.67) 2 = 2.78. Thus the 95% confidence 

2 23 / 2 78 

66.76 ± -== (1.28) J 12 + 1 + — T = 66.76 ± 3.30 in 
VTO V 2.66 2 

That is, we can be 95% confident that the sons' heights are between 63.46 and 70.06 in. 
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(b) If X = 70.0, then Y = 69.14 in. Also, (X - X) 2 = (70.0 - 66.67) 2 = 1 1.09. Thus the 95% confidence 
limits are computed to be 69.14 ± 3.45 in; that is, we can be 95% confident that the sons' heights are 
between 65.69 and 72.59 in. 

The following portion of the MIN1TAB output found in Problem 14.36 gives the confidence limits 
for the sons' heights. 

Predicted Values 

Fit StDevFit 95.0% CI 95.0% PI 

66.789 0.478 (65.724, 67.855) (63.485, 70.094) 

69.171 0.650 (67.723, 70.620) (65.724, 72.618) 

The confidence interval for individuals are sometimes referred to as prediction intervals. The 95% 
prediction intervals are shown in bold. These intervals agree with those computed above except for rounding 
errors. 



14.39 In Problem 14.1, find the 95% confidence limits for the mean heights of sons whose fathers' 
heights are (a) 65.0 in and (b) 70.0 in. Set the confidence interval without the aid of any computer 
software as well as with the aid of MINITAB computer software. 



where F„ = 35.82 + 0.476Z , S Y , X = 1.28, and S x = 2.66. 

(a) For X = 65.0, we find the confidence limits to be 66.76 ± 1.07 or 65.7 and 67.8. 

(b) For X = 70.0, we find the confidence limits to be 69.14 ± 1.45 or 67.7 and 70.6. 

The following portion of the MINITAB output found in Problem 14.36 gives the confidence limits for the 
mean heights. 

Predicted Values 

Fit StDevFit 95.0% CI 95.0% PI 

66.789 0.478 (65.724, 67.855) (63.485, 70.094) 

69.171 0.650 (67.723, 70.620) (65.724, 72.618) 



LINEAR REGRESSION AND CORRELATION 

14.40 Table 14.22 shows the first two grades (denoted by X and Y, respectively) of 10 students on two short quizzes 
in biology. 

(a) Construct a scatter diagram. 

(b) Find the least-squares regression line of Y on X. 

(c) Find the least-squares regression line of X on Y. 

(d) Graph the two regression lines of parts (b) and (c) on the scatter diagram of part (a). 



SOLUTION 



Since f 975 = 2.23 for 10 degrees of freedom, the 95% confidence limits for F P 



given by 




Supplementary Problems 



14.41 



Find (a) s Y .x and (b) s x , Y for the data in Table 14.22. 
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Table 14.22 



Grade on first quiz (X) 


6 5 


8 8 7 


6 10 4 9 7 


Grade on second quiz (Y) 


8 7 


7 10 5 


8 10 6 8 6 



14.42 Compute (a) the total variation in Y, (b) the unexplained variation in Y, and (c) the explained variation in 
Y for the data of Problem 14.40. 



14.43 Use the results of Problem 14.42 to find the correlation coefficient between the two sets of quiz grades of 
Problem 14.40. 

14.44 Find the correlation coefficient between the two sets of quiz grades in Problem 14.40 by using the product- 
moment formula, and compare this finding with the correlation coefficient given by SPSS, SAS, 
STATISTIX, MINITAB, and EXCEL. 

14.45 Find the covariance for the data of Problem 14.40(a) directly and (b) by using the formula s XY = rs x s Y and 
the result of Problem 14.43 or Problem 14.44. 

14.46 Table 14.23 shows the ages X and the systolic blood pressures Y of 12 women. 

(a) Find the correlation coefficient between X and Y using the product-moment formula, EXCEL, 
MINITAB, SAS, SPSS, and STATISTIX. 

(b) Determine the least-squares regression equation of Y on X by solving the normal equations, and by 
using EXCEL, MINITAB, SAS, SPSS, and STATISTIX. 

(c) Estimate the blood pressure of a woman whose age is 45 years. 



Table 14.23 



Age (X) 


56 42 72 36 63 47 55 49 38 42 68 60 


Blood pressure (Y) 


147 125 160 118 149 128 150 145 115 140 152 155 



14.47 Find the correlation coefficients for the data of (a) Problem 13.32 and (b) Problem 13.35. 

14.48 The correlation coefficient between two variables X and Y is r = 0.60. If s x = 1.50, s Y = 2.00, X = 10, and 
Y = 20, find the equations of the regression lines of (a) Y on X and (b) X on Y. 

14.49 Compute (a) s YX and (b) s X Y for the data of Problem 14.48. 

14.50 If s YX = 3 and s Y = 5, find r. 

14.51 If the correlation coefficient between X and Y is 0.50, whal percentage of the total variation remains 
unexplained by the regression equation? 

14.52 (a) Prove that the equation of the regression line of Y on X can be written 

Y - Y = S -jf (X - X) 
(b) Write the analogous equation for the regression line of X on Y. 
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14.53 (a) Compute the correlation coefficient between the corresponding values of X and Y given in Table 14.24. 
Table 14.24 



(b) Multiply each X value in the table by 2 and add 6. Multiply each Y value in the table by 3 and subtract 
15. Find the correlation coefficient between the two new sets of values, explaining why you do or do not 
obtain the same result as in part (a). 



14.54 (a) 

(b) 



Find the regression equations of yon X for the data considered ii 
Discuss the relationship between these regression equations. 



Problem 14.53, parts (a) and (b). 



14.55 (a) Prove that the correlation coefficient between X and Y can be written 



\J[X 2 — X 2 ][Y 2 — Y 2 ] 



(b) Using this method, work Problem 14.1. 

14.56 Prove that a correlation coefficient is independent of the choice of origin of the variables or the units in 
which they are expressed. (Hint: Assume that X' = c x X + A and Y' = c 2 Y + B, where a, c 2 , A, and B are 
any constants, and prove that the correlation coefficient between X' and Y 1 is the same as that between 
Xand Y.) 

14.57 (a) Prove that, for linear regression, 

s\.X _ S 2 X.Y 

4 4 

(b) Does the result hold for nonlinear regression? 



CORRELATION COEFFICIENT FOR GROUPED DATA 



14.58 Find the correlation coefficient between the heights and weights of the 300 U.S. adult males given ii 
Table 14.25, a frequency table. 







Heights X (in) 






59-62 


63-66 


67-70 


71-74 


75-78 




90-109 


2 


1 










110-129 


7 


8 


4 


2 






130-149 


5 


15 


22 


7 


1 




150-169 


2 


12 


63 


19 


5 




170-189 




7 


28 


32 


12 


190-209 




2 


10 


20 


7 




210-229 






1 


4 


2 
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14.59 (a) Find the least-squares regression equation of Y on X for the data of Problem 14.58. 
(b) Estimate the weights of two men whose heights are 64 and 72 in, respectively. 

14.60 Find (a) s Y , x and (b) s x . Y for the data of Problem 14.58. 

14.61 Establish formula (21) of this chapter for the correlation coefficient of grouped data. 
CORRELATION OF TIME SERIES 

14.62 Table 14.26 shows the average annual expenditures per consumer unit for health care and the per capita 
income for the years 1999 through 2004. Find the correlation coefficient. 

Table 14.26 



Year 


1999 


2000 


2001 


2002 


2003 


2004 


Health care cost 


1959 


2066 


2182 


2350 


2416 


2574 


Per capita income 


27939 


29845 


30574 


30810 


31484 


33050 



Source: Bureau of Labor Statistics and U.S. Bureau of Economic Analysis. 



14.63 Table 14.27 shows the average temperature and precipitation in a city for the month of July during the years 
2000 through 2006. Find the correlation coefficient. 

Table 14.27 



Year 


2000 


2001 


2002 


2003 


2004 


2005 


2006 


Temperature (°F) 


78.1 


71.8 


75.6 


72.7 


75.3 


73.6 


75.1 


Precipilalion (in) 


6.23 


3.64 


3.42 


2.84 


1.83 


2.82 


4.04 



SAMPLING THEORY OF CORRELATION 

14.64 A correlation coefficient based on a sample of size 27 was computed to be 0.40. Can we conclude at 
significance levels of (a) 0.05 and (b) 0.01, that the corresponding population correlation coefficient differs 
from zero? 

14.65 A correlation coefficient based on a sample of size 35 was computed to be 0.50. At the 0.05 significance level, 
can we reject the hypothesis that the population correlation coefficient is (a) as small as p = 0.30 and (b) as 
large as p = 0.70? 

14.66 Find the (a) 95% and (b) 99% confidence limits for a correlation coefficient that is computed to be 0.60 from 
a sample of size 28. 

14.67 Work Problem 14.66 if the sample size is 52. 

14.68 Find the 95% confidence limits for the correlation coefficients computed in (a) Problem 14.46 and 
(b) Problem 14.58. 
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14.69 Two correlation coefficients obtained from samples of size 23 and 28 were computed to be 0.80 and 0.95, 
respectively. Can we conclude at levels of (a) 0.05 and (b) 0.01 that there is a significant difference between 
the two coefficients? 

SAMPLING THEORY OF REGRESSION 

14.70 On the basis of a sample of size 27, a regression equation of Y on X was found to be Y = 25.0 + 2.QQX. If 
s Y .x = 1-50, s x = 3.00, and X = 7.50, find the (a) 95% and (b) 99% confidence limits for the regression 
coefficient. 

14.71 In Problem 14.70, test the hypothesis that the population regression coefficient at the 0.01 significance level is 
(a) as low as 1.70 and (b) as high as 2.20. 

14.72 In Problem 14.70, find the (a) 95% and (b) 99% confidence limits for Y when X = 6.00. 

14.73 In Problem 14.70, find the (a) 95% and (b) 99% confidence limits for the mean of all values of Y 
corresponding to X = 6.00. 

14.74 Referring to Problem 14.46, find the 95% confidence limits for (a) the regression coefficient of Y on X. (/>) the 
blood pressures of all women who are 45 years old, and (c) the mean of the blood pressures of all women 
who are 45 years old. 



Multiple and Partial 
Correlation 



MULTIPLE CORRELATION 

The degree of relationship existing between three or more variables is called multiple correlation. The 
fundamental principles involved in problems of multiple correlation are analogous to those of simple 
correlation, as treated in Chapter 14. 



SUBSCRIPT NOTATION 

To allow for generalizations to large numbers of variables, it is convenient to adopt a notation 
involving subscripts. 

We shall let X U X 2 ,X 3 , ... denote the variables under consideration. Then we can let 
X n ,X 12 ,X l3 , . . . denote the values assumed by the variable X x , and X 2l , X 22 , X 23 , . . . denote the values 

assumed by the variable X 2 , and so on. With this notation, a sum such as X 2l + X 22 + X 23 H h X 2N 

could be written Yl!j=\ x ip H,j X 2j, or simply Y^ x i- When no ambiguity can result, we use the last 
notation. In such case the mean of X 2 is written X 2 — ]T X 2 /N. 



REGRESSION EQUATIONS AND REGRESSION PLANES 

A regression equation is an equation for estimating a dependent variable, say X { , from the indepen- 
dent variables X 2 ,X 3 , . . . and is called a regression equation of X l on X 2 ,X 3 , In functional notation 

this is sometimes written briefly as X l = F(X 2 ,X 3 , . . .) (read "1^ is a function of X 2 ,X 3 , and so on"). 

For the case of three variables, the simplest regression equation of X { on X 2 and X 3 has the form 

X 1 =b L23 +b 123 X 2 + b l3 . 2 X 3 (7) 

where 61.23, ^12.3^ an d 613.2 are constants. If we keep X 3 constant in equation (7), the graph of X x versus X 2 is 
a straight line with slope b X2 3 . If we keep X 2 constant, the graph of X x versus X 3 is a straight line with 
slope b l3 2 . It is clear that the subscripts after the dot indicate the variables held constant in each case. 

Due to the fact that X x varies partially because of variation in X 2 and partially because of variation 
in X 3 , we call 612.3 an d 613.2 the partial regression coefficients of X x on X 2 keeping X 3 constant and of X x 
on X 3 keeping X 2 constant, respectively. 

382 
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Equation (7) is called a linear regression equation of X x on X 2 and I 3 . In a three-dimensional 
rectangular coordinate system it represents a plane called a regression plane and is a generalization of 
the regression line for two variables, as considered in Chapter 13. 



NORMAL EQUATIONS FOR THE LEAST-SQUARES REGRESSION PLANE 

Just as there exist least-squares regression lines approximating a set of N data points (X, Y) in a two- 
dimensional scatter diagram, so also there exist least-squares regression planes fitting a set of N data 
points (X l ,X 2 ,X 3 ) in a three-dimensional scatter diagram. 

The least-squares regression plane of X l on X 2 and X 3 has the equation (7) where 61.23, 612.3, an d 613.2 
are determined by solving simultaneously the normal equations 

£ x \ =bi.23N +6 12 . 3 £X 2 + 6 13 . 2 £^ 3 

E^2 = 6i.2 3 E^2 + 6i 2 .3E^2 2 + 6 13 .2 £ X 2 X 3 (2) 

£ ^z 3 = 61.23 £ ^3 + 612.3 E x 2 x 3 + 613.2 E A 

These can be obtained formally by multiplying both sides of equation (7) by 1, X 2 , and X 3 successively 
and summing on both sides. 

Unless otherwise specified, whenever we refer to a regression equation it will be assumed that the 
least-squares regression equation is meant. 

If x x = Xi - X x , x 2 = X 2 - X 2 , and x 3 = X 3 - X 3 , the regression equation of X x on X 2 and X 3 can 
be written more simply as 

x i = 612.3*2 + 613.2*3 (3) 
where 612.3 an d b l32 are obtained by solving simultaneously the equations 

E *1*2 = 612.3 £*2 + 613.2 £ *2*3 
£*1*3 = 612.3 £*2*3 + 613.2 £*3 

These equations which are equivalent to the normal equations (2) can be obtained formally by 
multiplying both sides of equation (5) by x 2 and x 3 successively and summing on both sides 
(see Problem 15.8). 



REGRESSION PLANES AND CORRELATION COEFFICIENTS 

If the linear correlation coefficients between variables X x and X 2 , X x and X 3 , and X 2 and X 3 , as 
computed in Chapter 14, are denoted respectively by r 12 , r l3 , and r 23 (sometimes called zero-order 
correlation coefficients), then the least-squares regression plane has the equation 

X\_ _ ( r \2 - >' 13 r 23 A 5 / r 13 - ^12^23 ^ *3 / j-n 

*r\ 1-^3 )*2 + { \~r\ 3 )s 3 & 

where x x — X - X x , x 2 = X 2 - X 2 , and x 3 = X 3 - X 3 and where s x , s 2 , and s 3 are the standard deviations 
of X x ,X 2 , and X 3 , respectively (see Problem 15.9). 

Note that if the variable X 3 is nonexistent and if X x = Y and X 2 = X, then equation (5) reduces to 
equation (25) of Chapter 14. 
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STANDARD ERROR OF ESTIMATE 

By an obvious generalization of equation (8) of Chapter 14, we can define the standard error of 
estimate of X x on X 2 and X 3 by 

where X l est indicates the estimated values of X x as calculated from the regression equations (7) or (5). 

In terms of the correlation coefficients r n , r u , and r 23 , the standard error of estimate can also be 
computed from the result 



1-^3 



r\ 3 + 2r l2 r l3 r 23 



The sampling interpretation of the standard error of estimate for two variables as given on page 313 
for the case when N is large can be extended to three dimensions by replacing the lines parallel to the 
regression line with planes parallel to the regressio n plane. A better estimate of the population standard 
error of estimate is given by s l23 = y / N/(N - 3).v 12 3- 



COEFFICIENT OF MULTIPLE CORRELATION 

The coefficient of multiple correlation is defined by an extension of equation (12) or (14) of Chapter 
14. In the case of two independent variables, for example, the coefficient of multiple correlation is given by 




where s x is the standard deviation of the variable X x and 5^23 is given by equation (6) or (7). The quantity 
iff.23 is called the coefficient of multiple determination. 

When a linear regression equation is used, the coefficient of multiple correlation is called the coeffi- 
cient of linear multiple correlation. Unless otherwise specified, whenever we refer to multiple correlation, 
we shall imply linear multiple correlation. 

In terms of r l2 , r l3 , and r 23 , equation (8) can also be written 

A coefficient of multiple correlation, such as -R1.23, lies between and 1. The closer it is to 1, the 
better is the linear relationship between the variables. The closer it is to 0, the worse is the linear 
relationship. If the coefficient of multiple correlation is 1, the correlation is called perfect. Although a 
correlation coefficient of indicates no linear relationship between the variables, it is possible that a 
nonlinear relationship may exist. 



CHANGE OF DEPENDENT VARIABLE 

The above results hold when X { is considered the dependent variable. However, if we want to 
consider X 3 (for example) to be the dependent variable instead of X u we would only have to replace 
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the subscripts 1 with 3, and 3 with 1, in the formulas already obtained. For example, the regression 
equation of X 3 on X x and X 2 would be 

^3 _ / ^23 ~ r U r \l \ *2 , ( r jz^ffnfn\ *i_ / m x 

s,-{ \-r\ 2 )s 2 + { l-r\ 2 )s x {W) 

as obtained from equation (J), using the results r 32 = r 23 , r 3X = r X3 , and r 2X = r l2 . 



GENERALIZATIONS TO MORE THAN THREE VARIABLES 

These are obtained by analogy with the above results. For example, the linear regression equations 
of X { on X 2 , X 3 , and X 4 can be written 

X\ = &1.234 + ^12.34^2 + ^13.24^3 + ^14.23^4 (H) 

and represents a hyperplane in four-dimensional space. By formally multiplying both sides of equation 
(77) by 1, X 2 , X 3 , and X 4 successively and then summing on both sides, we obtain the normal equations 
for determining ft 1.234, ftn.34, fti3.24> and Z> 14 23; substituting these in equation (77) then gives us the least- 
squares regression equation ofX x on X 2 , X 3 , and X 4 . This least-squares regression equation can be written 
in a form similar to that of equation (5). (See Problem 15.41.) 



PARTIAL CORRELATION 

It is often important to measure the correlation between a dependent variable and one particular 
independent variable when all other variables involved are kept constant; that is, when the effects of all 
other variables are removed (often indicated by the phrase "other things being equal"). This can be 
obtained by defining a coefficient of partial correlation, as in equation (72) of Chapter 14, except that we 
must consider the explained and unexplained variations that arise both with and without the particular 
independent variable. 

If we denote by r X23 the coefficient of partial correlation between X x and X 2 keeping X 3 constant, 
we find that 

^2.3 = , (12) 

V /(l-^)(l_,2 3) 

Similarly, if r 12 . 34 is the coefficient of partial correlation between X x and X 2 keeping X 3 and X 4 constant, 
then 

(75) 



V^ 1 - ^.4)(1 -rLO v /(l-r? 43 )(l- 



These results are useful since by means of them any partial correlation coefficient can ultimately be made 
to depend on the correlation coefficients r l2 , r 23 , etc. (i.e., the zero-order correlation coefficients). 

In the case of two variables, X and Y, if the two regression lines have equations Y = a + a x X and 
X = b + b x Y, we have seen that r 2 = a x b x (see Problem 14.22). This result can be generalized. 
For example, if 

X\ = &1.234 + ^12.34^2 + ftl3.24^3 + ^14.23^4 (14) 
and X A = ft 4 123 + ft 4 i.23^1 + ft42. 13-^2 + ^43.12^3 (15) 
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are linear regression equations of X x on X 2 , X 3 , and X 4 and of X 4 on X u X 2 , and X 3 , respectively, then 

^4.23 = ^14.23^41.23 (^) 

(see Problem 15.18). This can be taken as the starting point for a definition of linear partial correlation 
coefficients. 



RELATIONSHIPS BETWEEN MULTIPLE AND PARTIAL CORRELATION COEFFICIENTS 

Interesting results connecting the multiple correlation coefficients can be found. For example, 
we find that 

1 -^T.23 = (1 -r? 2 )(l -r? 3 . 2 ) (17) 
1 " * 1.234 = ( 1 - r\ 2 ) ( 1 - r] 32 ) ( 1 - r\ 423 ) (18) 
Generalizations of these results are easily made. 

NONLINEAR MULTIPLE REGRESSION 

The above results for linear multiple regression can be extended to nonlinear multiple regression. 
Coefficients of multiple and partial correlation can then be defined by methods similar to those given 
above. 



Solved Problems 

REGRESSION EQUATIONS INVOLVING THREE VARIABLES 

15.1 Using an appropriate subscript notation, write the regression equations of (a), X 2 on X x and X 3 \ 

(b) X 3 on X x , X 2 , and X 4 ; and (c) X 5 on X u X 2 , X 3 , and X 4 . 

SOLUTION 

(a) X 2 = b 2A 3 + b 2X . 3 X x + b 23A X 3 

(b) X 3 = b 3A24 + b 3L24 X x + b 32A4 X 2 + b 34A2 X 4 

(C) X 5 = b 5A234 + ^5 1.234^1 + ^2.134^2 + ^53.124^3 + b i4A23 X 4 

15.2 Write the normal equations corresponding to the regression equations (a) X 3 = b 3 l2 + 
bi\.iX\ + b 32A X 2 and (b) X x = b X234 + b X234 X 2 + b X3 24 X 3 + b l423 X 4 . 

SOLUTION 

(a) Multiply the equation successively by 1, X x , and X 2 , and sum on both sides. The normal equations are 

E^3 =b 3A2 N +b 3X . 2 X x + 632.1 £^2 

£ x x x 3 = b 3A2 £ x x + b 3L2 £ x\ + b 32A £ x x x 2 
£ x 2 x 3 = 63.12 E x 2 + b 3L2 E x x x 2 + b 32A £ x\ 
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(6) Multiply the equation successively by 1, X 2 , X 3 , and X 4 , and sum on both sides. The normal 
equations are 

£x, =b L234 N +b l234 j:x 2 + 6,3.24 E ^3 +b l423 J2X 4 
Y / x l x 2 = b l234 Y,x 2 + b l2M J2 x\ + b U24 y, x 2 x 3 + b l423 J2 x 2 x 4 

EVl= h 134 E ^3 + b nM E X 2 X 3 + 6,3.24 E 4 + ^14.23 E X 3 X 4 
E X X X 4 = b ]234 J2 X 4 + &12.34 E *2*4 + ^13.24 E ^3^4 + ^14.23 E *4 

Note that these are not derivations of the normal equations, but only formal means for 
remembering them. 

The number of normal equations is equal to the number of unknown constants. 

15.3 Table 15.1 shows the weights X, to the nearest pound (lb), the heights X 2 to the nearest inch (in), 
and the ages X 3 to the nearest year of 12 boys. 

(a) Find the least-squares regression equation of X l on X 2 and X 3 . 

(b) Determine the estimated values of X x from the given values of X 2 and X 3 . 

(c) Estimate the weight of a boy who is 9 years old and 54 in tall. 

(d) Find the least-squares regression equation using EXCEL, MINITAB, SPSS and 
STATISTIX. 



Table 15.1 



Weight (X,) 


64 


71 


53 


67 


55 


58 


77 


57 


56 


51 


76 


68 


Height (X 2 ) 


57 


59 


49 


62 


51 


50 


55 


48 


52 


42 


61 


57 


Age (X 3 ) 


8 


10 


6 


11 


8 


7 


10 


9 


10 


6 


12 


9 



SOLUTION 

(a) The linear regression equation of X, on X 2 and X 3 can be written 
X\ = 6, 23 + b n3 X 2 + b l32 X 3 
The normal equations of the least-squares regression equation are 

E x \ =b l23 N +b u , 3 J2X 2 +b l32 J2X 3 

J2 x ] x 2 = b l23 j2 x 2 + b l23 j2 xi + b 13 . 2 E x 2 x 3 (19) 
E x,x 3 = b l 23 Y,x 3 + b i23 J2 x 2 x 3 + b U2 J2 x\ 

The work involved in computing the sums can be arranged as in Table 15.2. (Although the column 
headed xj is not needed at present, it has been added for future reference.) Using Table 15.2, the 
normal equations (79) become 

m l23 + 6436,2 3 + 1066,3.2= 753 
6436,. 23 + 34,8436,2.3 + 5,7796 13 . 2 = 40,830 {20) 
1066,23 + 5,7796,2.3 + 9766, 3 . 2 = 6,796 
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Solving, bi.23 = 3.6512, £> 123 = 0.8546, and b n2 = 1.5063, and the required regression equation is 

Xi = 3.6512 + 0.8546X 2 + 1.5063X 3 or X, = 3.65 + 0.855X 2 + 1.506X 3 (21) 



Table 15.2 





X 2 


^3 


Xi 


x\ 


A 


X X X 2 


x x x 3 


X 2 X } 




















71 


59 


10 


5041 


3481 


100 


4189 


710 


590 


53 


49 


6 


2809 


2401 


36 


2597 


318 


294 


67 


62 


11 


4489 


3844 


121 


4154 


737 


682 




















58 


50 


7 


3364 


2500 


49 


2900 


406 


350 


77 


55 


10 


5929 


3025 


100 


4235 


770 


550 


57 


48 


9 


3249 


2304 


81 


2736 


513 


432 


56 


52 


10 


3136 


2704 


100 


2912 


560 


520 


51 


42 


6 


2601 


1764 


36 


2142 


306 


252 


76 


61 


12 


5776 


3721 


144 


4636 


912 


732 


68 


57 


9 


4624 


3249 


81 


3876 


612 


513 


E x \ 


E^2 


E^3 


E *\ 


E4 


E4 


E^2 


E^1^3 


E*2^3 


= 753 


= 643 


= 106 


= 48,139 


= 34,843 


= 976 


= 40,830 


= 6796 


= 5779 



For another method, which avoids solving simultaneous equations, see Problem 15.6. 
(b) Using the regression equation (21), we obtain the estimated values of X\, denoted by X 1;est , by sub- 
stituting the corresponding values of X 2 and X 3 . For example, substituting X 2 = 57 and X 3 = 8 in (21 ), 
we find X li6St = 64.414. 





The other estimated values of X x are obtained similarly. They are given i 
the sample values of X x . 


n Table 15.3 together with 




Table 15.3 




J+est 


64.414 69.136 54.564 73.206 59.286 56.925 65.717 58.229 63.153 


48.582 73.857 65.920 




64 71 53 67 55 58 77 57 56 


51 76 68 



(c) Putting X 2 = 54 and X 3 = 9 in equation (21), the estimated weight is X x cst = 63.356, or about 
63 lb. 

(d) Part of the EXCEL output is shown in Fig. 15-1. The pull-down Tools — » Data analysis — » Regression 

is used to give the output. The coefficients b t 2 3 = 3.65 1 2, b( 2 3 = 0.8546, and b 13 2 = 1.5063 are shown 
in bold in the output. 

Part of the MINITAB output is the regression equation is XI = 3.7 + 0.855X2 + 1.51X3. The 
pull-down Stat — » Regression — » Regression is used after the data are entered into ci - C3. 

A portion of the SPSS output is shown in Fig. 15-2. The pull-down analyse — ► Regression — ► Linear 
gives the output. The coefficients b\ , 23 = 3.651, b U3 = 0.855, and b U 2 = 1.506 are shown in the output 
under the column titled Unstandardized Coefficients. 

Part of the STATISTIX output is shown in Fig. 15-3. The pull-down Statistics -> Linear models -> 
Linear Regression gives the output. 
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71 


59 


10 


Regressior 


Statistics 


53 


49 
62 


6 


Multiple R 


0.841757 


67 


11 


R Square 


0.708554 


55 


51 


8 


Adjusted R Squar 


e 0.643789 


58 


50 


7 


Standard Error 


5.363215 
12 


77 


55 


10 


Observations 


57 


48 


9 






56 


52 


10 


ANOVA 




51 


42 


6 




df 


76 


61 


12 


Regression 


2 


68 


57 


9 


Residual 
Total 


9 



Coefficients 
Intercept 3.651216 
X2 0.85461 
X3 1.506332 
Fig. 15-1 EXCEL output for Problem 153(d). 



Coefficients 3 



Model 


Unstandardized 
Coefficients 


Standardized 
Coefficients 


t 


Sig. 


B 


Std. Error 


Beta 


1 (Constant) 


3.651 


16.168 




.226 


.826 


X2 


.855 


.452 


.565 


1.892 


.091 


X3 


1.506 


1.414 


.318 


1.065 


.315 



Statistix 8.0 






Unweighted Least Squares Linear Regressio 


riof X1 




Predictor 






Variables Coefficient Std Error T 


P 


VIF 


Constant 3.65122 16.1678 0.23 


0.8264 




X2 0.85461 0.45166 1.89 


0.0910 


2.8 


X3 1.50633 1.41427 1.07 


0.3146 


2.8 



Fig. 15-3 STATISTIX output for Problem 15.3(d). 



The software solutions are the same as the normal equations solution. 

15.4 Calculate the standard deviations (a) s u (b) s 2 , and (c) s 3 for the data of Problem 15.3. 
SOLUTION 

(a) The quantity Si is the standard deviation of the variable X x . Then, using Table 15.2 of Problem 15.3 
and the methods of Chapter 4, we find 
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Compute (a) r 12 , (b) r, 3 , and (c) r 23 for the data of Problem 15.3. Compute the three correlations 
by using EXCEL, MINITAB, SPSS, and STATISTIX. 



(a) The quantity r 12 is the linear correlation coefficient between the variables X x and X 2 , ignoring the 
variable Z 3 . Then, using the methods of Chapter 14, we have 



NY.x l x 2 -{Y,x l ){Y l x 2 ) 
y/WL A - (£ ^i) 2 M£ x\ - (E x 2 f] 

(12)(40,830) - (753)(643) 

A /[(12)(48,139) - (753) 2 ][(12)(34,843) - (643) 2 ] 



(b) and (c) Using corresponding formulas, we obtain r n = 0.7698, or 0.77, and r 23 = 0.7984, or 0.80. 
(d) Using EXCEL, we have: 



0.819645 =CORREL(A2:A13,B2:B13) 
0.769817 =CORREL(A2:A13,C2:C13) 
0.798407 =CORREL(B2:B13,C2:C13) 



We find that r i2 is in Dl, r 13 is in D2, and r 23 is in D3. The functions we used to produce the 
correlations in EXCEL are shown in El, E2, and E3. 

Using the MINITAB pull-down Stat — > Basic Statistics — * Correlation gives the following output. 



Correlations: XI, X2, X3 

XI X2 
X2 0.820 
0.001 

X3 0.770 0.798 

0.003 0.002 

Cell Contents : Pearson correlation 
P-Value 



The correlation r 12 is at the intersection of XI and X2 and is seen to be 0.820. The value below it is 
0.001. It is the /j-value for testing no population correlation between XI and X2. Because this ^-value is 
less than 0.05, we would reject the null hypothesis of no population correlation between height (X2) and 
weight (XI). All the other correlations and their ^-values are read similarly. 
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Using the SPSS pull-down Analyze —> Correlate — > Bivariate gives the following output which is read 
similar to MINITAB. 



Correlations 





X1 


X2 


X3 


X1 


Pearson Correlation 


1 


.820** 


.770** 




Sig. (2-tailed) 




.001 


.003 




N 


12 


12 


12 


X2 


Pearson Correlation 


.820** 


1 


.798** 




Sig. (2-tailed) 


.001 




.002 




N 


12 


12 


12 


X3 


Pearson Correlation 


.770** 


.798** 


1 




Sig. (2-tailed) 


.003 


.002 






N 


12 


12 


12 



"Correlation is significant at the 0.01 level (2-tailed). 



Using the STATISTIX pull-down Statistics Linear models -» Correlation gives the following 
output which is similar to the other software outputs. 

Statistix 8 . 
Correlations (Pearson) 

XI X2 

X2 0.8196 
P-VALUE 0.0011 
X3 0.7698 0.7984 

0.0034 0.0018 

Once again, we see that the software saves us a great deal of time doing the computations for us. 



15.6 Work Problem 15.3(a) by using equation (J) of this chapter and the results of Problems 15.4 
and 15.5. 

SOLUTION 

The regression equation of X x on X 2 and X 3 is, on multiplying both sides of equation (5) by 

*-(™)0- + (^r)©- 

where x x = X X — X u x 2 = X 2 -X 2 , and x 3 =X 3 -X 3 . Using the results of Problems 15.4 and 15.5, 
equation (22) becomes 

x, = 0.8546x 2 + 1.5063x 3 

Since X, = = — = 62.750 X 2 =±^= 53.583 and X 3 = 8.833 

(from Table 15.2 of Problem 15.3), the required equation can be written 

X x - 62.750 = 0.8546(X 2 - 53.583) + 1.506(X 3 - 8.833) 
agreeing with the result of Problem 15.3(a). 
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15.7 For the data of Problem 15.3, determine (a) the average increase in weight per inch of increase 
in height for boys of the same age and (b) the average increase in weight per year for boys having 
the same height. 
SOLUTION 

From the regression equation obtained in Problem 15.3(a) or 15.6 we see that the answer to (a) is 
0.8546, or about 0.9 lb, and that the answer to (b) is 1.5063, or about 1.5 lb. 



15.8 Show that equations (J) and (4) of this chapter follow from equations (7) and (2). 
SOLUTION 

From the first of equations (2), on dividing both sides by N, we have 

X\ = &1.23 + 6,2.3^2 + ^13.2^3 (25) 

Subtracting equation (23) from equation (7) gives 

*i-Xi= hiA^i - Xi) + 613.2(^3 - x 3 ) 

or X, = b U3 x 2 + 6,3.2*3 (24) 

which is equation (3). 

Let Xi = x, + X u X 2 = x 2 + X 2 , and X 3 = x 3 + X 3 in the second and third of equations (2). Then after 
some algebraic simplifications, using the results Yl x \ = Yl x i ~ Y x 3 = 0, they become 

£xi*2 =^12.3 £*2 + 6l3.2 Ex2X 3 +NX 2 [bi.23+ b n . 3 X 2 +^3.2X3-^] (25) 

£*i*3 = 612.3 £*2*3 + 613.2 £*3 +NX 3 [bi2 3 + bi 23 X 2 + bi 32 X 3 -Xi\ (26) 

which reduce to equations (4) since the quantities in brackets on the right-hand sides of equations (25) and 
(26) are zero because of equation (7). 



15.9 Establish equation (5), repeated here: 

xj_ = f ri 2 -r l3 r 23 \x 2 f r u -r n r 23 \ x 3 

*i V i-'i A V 1-^3 M 

SOLUTION 

From equations (25) and (26) 

&12.3 Y x l +bi 3 . 2 T,x 2 x 3 = J2 x \ x 2 
bu.3T,x 2 x 3 + bi 32 J2x 2 3 = Y x x x 3 

Since ^ = ^W and ' ? ' = ^\T 

J2 x\ = Ns\ and £ x\ = Nsj. Since 

r Y X 2 X 3 _ Ex 2 x 3 

£ x 2 x 3 = Ns 2 s 3 r 23 . Similarly, ^ X[X 2 = Ns\S 2 r n and £ X|X 3 = Nsis 3 r X3 . 



(27) 
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Substituting in (27) and simplifying, we find 

k\23 s 2 r 23 + ^13.2- s 3 = s \ r U 

Solving equations (28) simultaneously, we have 



^12.3 = 



^12 ~ r 13^23 ^ ( t 



Substituting these in the equation x x = b U3 x 2 + b n2 x 3 [equation (24)] and dividing by s\ yields the 
required result. 



STANDARD ERROR OF ESTIMATE 

15.10 Compute the standard error of estimate of X { on X 2 and X 3 for the data of Problem 15.3. 
SOLUTION 

From Table 15.3 of Problem 15.3 we have 



■ / £(*! -*l,ct) 2 

'^ = V N 

_ y /(64 - 64.414) 2 + (71 - 69.136) 2 + ■ ■ ■ + (68 - 65.920) 2 _ ^ 



.6447 or 4.61b 



The population standard error of estimate is estimated by s L2 i = \/N/(N - 3)s 12 i = 5-3 lb in this c; 
15.11 To obtain the result of Problem 15.10, use 



^12.3 = x l 



SOLUTION 

From Problems 15.4(a) and 15.5 we have 



1 — r 12 — r B ~ r 23 + ^ r I2 r 13^23 



1-^ 



o 1 - (0.8196) 2 - (0.7698r - (0.7984)'' + 2(0.8196)(0.7698)(0.7984) „ rll 

s\ 23 = 8.6035i / , = 4.6 lb 

V 1 - (0.7984) 2 

Note that by the method of this problem the standard error of estimate can be found without using the 
regression equation. 



COEFFICIENT OF MULTIPLE CORRELATION 

15.12 Compute the coefficient of linear multiple correlation of X l on X 2 and X 3 for the data of Problem 
15.3. Refer to the M1NITAB output in the solution of Problem 15.3 to determine the coefficient 
of linear multiple correlation. 
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SOLUTION 

First method 

From the results of Problems 15.4(a) and 15.10 we have 

* ia = J^=,/l-(^ = 0.8418 
V sf V (8.6035) 2 



Second method 

From the results of Problem 15.5 we have 



R = l r 2 u + r 2 u -2r l2ri ^ _ / (0.8196) 2 + (0.7698) 2 - 2(0.8196)(0.7698)(0.7984) _ Q g<|ig 
123 V 1 - r 23 V 1 - (0.7984) 2 

Note that the coefficient of multiple correlation, R t 23 , is larger than either of the coefficients r 12 or r 13 
(see Problem 15.5). This is always true and is in fact to be expected, since by taking into account additional 
relevant independent variables we should arrive at a better relationship between the variables. 

The following part of the MINITAB output given in the solution of Problem 15.3, R-Sq = 7 . 9%, 
gives the square of the coefficient of linear multiple correlation. The coefficient of linear multiple correlation 
is the square rooot of this value. That is R r 23 = V0.709 = 0.842. 



15.13 Compute the coefficient of multiple determination of X\ on X 2 and X 3 for the data of Problem 
15.3. Refer to the MINITAB output in the solution of Problem 15.3 to determine the coefficient 
of multiple determination. 

SOLUTION 

The coefficient of multiple determination of X\ on X 2 and X 3 is 
R 2 l2i = (0.8418) 2 = 0.7086 

using Problem 15.12. Thus about 71% of the total variation in X i is explained by using the regression 

The coefficient of multiple determination is read directly from the MINITAB output given in the 
solution of Problem 15.3 as R-Sq = 70.9%. 



15.14 For the data of Problem 15.3, calculate (a) 7? 2 .i3 and (b) R 3 U an d compare their values with the 
value of Dis- 
solution 



(a) R 2 .n = ^ 
{b) R 3 . l2 = \ 



/(0.8196) 2 + (0.7984)- - 2(().8196)(0.7698)(0.7984) _ 



/(0.7698) 2 + (0.7984) 2 - 2(0.8196)(0.7698)(0.7984) _ 



This problem illustrates the fact that, in general, R 2 .n, Ryu, and R L23 are not necessarily equal, as seen 
by comparison with Problem 15.12. 
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15.15 If R L23 = 1, prove that (a) R 2A3 = 1 and (b) R 3A 
SOLUTION 



R 23 = j r 2 ]2 + r 2 l3 -2r l2 r n r 23 

V 1 ~ f 23 

R 2 13 = y ^' 2 + ^ ~ frttWa 



(50) 

(a) In equation (29), setting iij 23 = 1 and squaring both sides, rf 2 + rf 3 — 2r 12 r 13 /- 23 = 1 — r] v Then 

r 2 + r 2 _ 2r r r - 1 - r 2 or r ' 2 + ^ ~ 2 '' 12 '" 13 '' 23 - 1 
1 '1: 

That is, R 2A3 = 1 or i? 213 = 1, since the coefficient of multiple correlation is considered nonnegative. 

(b) -R3.12 = 1 follows from part (a) by interchanging subscripts 2 and 3 in the result R 2 .13 = 1. 

15.16 If 7*1.23 = °> does it necessarily follow that R 2A3 = 0? 
SOLUTION 

From equation (29), i?[ 23 = if and only if 

r 2 n + r\ 3 - 2r 12 r l3 r 23 =0 or 2r 12 r 13 r 23 = r] 2 + r\ 3 
Then from equation (50) we have 

/ r 2 2 + ,2 3 _ ( ,2 2 + ,2 3) G^T 

213 "V r ^ " V T3 ^7 

which is not necessarily zero. 



PARTIAL CORRELATION 



15.17 For the data of Problem 15.3, compute the coefficients of linear partial correlation r l2 . 3 , r 13 2 , 
and r 23 .\. Also use STATISTIX to do these same computations. 
SOLUTION 

' I - i — '-13.2 = YU f '2' jj == ,. 23 ] = '23 r \2 r U == 

V / (l-'n)(l-d 3 ) ^(1 -r? 2 )(l -r| 3 ) ^(l-,f 2 )(l-r 2 3 ) 

Using the results of Problem 15.5, we find that r 12 3 = 0.5334, r U2 = 0.3346, and r 23A = 0.4580. It follows that 
for boys of the same age, the correlation coefficient between weight and height is 0.53; for boys of the same 
height, the correlation coefficient between weight and age is only 0.33. Since these results are based on a small 
sample of only 12 boys, they are of course not as reliable as those which would be obtained from a larger sample. 

Use the pull-down Statistics — ► Linear models — » Partial Correlations to obtain the dialog box in Fig. 15-4. 
Fill it in as shown. It is requesting r 123 . The following output is obtained. 

Statistix 8 . 

Partial Correlations with XI 
Controlled for X3 



X2 



0.5335 
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Dependent Viable 



Help | 



Lurrdahcfi Variable.; 



Weight Variable [Opt] 
^ | J^J | |7 Fit Constant 



Fig. 15-4 STATISTIX dialog box for Problem 15.17. 
Similarly, STATISTIX may be used to find the other two partial correlations. 

15.18 If X x = b { 23 + b l23 X 2 + ^13.2^3 an d X 3 = b 3l2 + b 32l X 2 + b 3X2 X x are the regression equations of 
X x on X 2 and X 3 and of X 3 on X 2 and Z 1; respectively, prove that r 2 n2 = ^13.2^31.2- 
SOLUTION 

The regression equation of X x on X 2 and X 3 can be written [see equation (5) of this chapter] 

- 0^)©«-^t^)(£)<''-*> ,J " 

The regression equation of X 3 on X 2 and ^ can be written [see equation (10)] 

(™) + ) ®<*. - («) 

From equations (57) and (32) the coefficients of X 3 and are, respectively, 

, ( ''13 ~ ryfyA ( S 3 \ 



( rn - r i2 r 2 i \ ( sA 



, , {rn ~ r l2 r 23 ) 2 2 

ftl "^ = (l-i3)(l-^ = ri " 



15.19 If r u .i = 0, prove that 



I - r 2 \ - r 2 

(a) r m = r 13 W- f 1 0) r 23 ., = r 23 W- j 1 

V 1 ~~ r \2 V 1 _ '"12 



"^(l-rf 3 )(l-^)" 
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we have r l2 = r l3 r 23 . 

_ r n -r n r 2i _ r n - (r n r 23 )r 23 _ r l3 (\ - r 23 ) 



(a) 



^(1-^(1-^) ^(l-^Kl-i^) ^(l-r] 2 )(l- r j 3 ) 
(6) Interchange the subscripts 1 and 2 in the result of part (a). 



MULTIPLE AND PARTIAL CORRELATION INVOLVING FOUR OR MORE VARIABLES 

15.20 A college entrance examination consisted of three tests: in mathematics, English, and general 
knowledge. To test the ability of the examination to predict performance in a statistics course, 
data concerning a sample of 200 students were gathered and analyzed. Letting 

X y = grade in statistics course X 3 = score on English test 

X 2 — score on mathematics test X 4 = score on general knowledge test 

the following calculations were obtained: 

X x = 75 s x = 10 X 2 = 24 s 2 = 5 

X 3 = 15 s 3 = 3 X 4 = 36 s 4 = 6 
r n = 0.90 r X3 = 0.75 r x4 = 0.80 r 23 = 0.70 r 24 = 0.70 r 34 = 0.85 
Find the least-squares regression equation of X x on X 2 , X 3 , and X 4 . 
SOLUTION 

Generalizing the result of Problem 15.8, we can write the least-squares regression equation of X x on X 2 , 
X). and X 4 in the form 

X\ = 6,2.34*2 + b u24 X 3 + 6,4.23*4 {33) 

where 612.34, 613.24, an d 6,423 can be obtained from the normal equations 

E X 1 X 2 = 6l2.34 E*2 + 6,3.24 E *2*3 + 6,4.23 E *2*4 

E*l*3 =612.34 E*2*3 + 613.24 E*3 + 6,4.23 E *3*4 {34) 
E X\X 4 = 6,2.34 E *2*4 + 6,3.24 E *3*4 + ^14.23 E X l 

and where x, = X x - X x , x 2 = X 2 - X 2 , x 3 = X 3 - X 3 , and x 4 = X 4 - X 4 . 
From the given data, we find 





~Ns 2 2 


= 5000 




= Ns x s 2 r l2 = 9000 


£*2*3 


= Ns r s 3 r 23 


= 2100 


E*3 


- 7V.V3 


= 1800 




= Ns x s 3 r X3 = 4500 


E*2*4 


= Ns 2 s 4 r 24 


= 4200 


E4 


-Ns 4 


= 7200 


E*l*4 


= Ns x s 4 r X4 = 9600 


£*3*4 


= Ns 3 s 4 r 34 


= 3060 



Putting these results into equations (34) and solving, we obtain 

612.34 = L3333 6,3.24 = 0.0000 6,4.23 = 0.5556 (35) 
which, when substituted in equation {33). yield the required regression equation 

x, = 1.3333^2 + O.OOOOxj + 0.5556x4 
or X, - 75 = 1.3333(X,-24)+0.5556(X 4 -27) (36) 

or X, = 22.9999 + 1.3333Z 2 + 0.5556Z 4 
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An exact solution of equations (34) yields b\ 1M = f, 613.24 = 0, and b H2 3 = §, so that the regression 
equation can also be written 

X x = 23 + fX 2 +fX 4 (37) 

It is interesting to note that the regression equation does not involve the score in English, namely, X 3 . 
This does not mean that one's knowledge of English has no bearing on proficiency in statistics. Instead, it 
means that the need for English, insofar as prediction of the statistics grade is concerned, is amply evidenced 
by the scores achieved on the other tests. 



15.21 Two students taking the college entrance examination of Problem 15.20 receive respective scores 
of (a) 30 in mathematics, 18 in English, and 32 in general knowledge; and (b) 18 in mathematics, 
20 in English, and 36 in general knowledge. What would be their predicted grades in statistics? 

SOLUTION 

(a) Substituting X 2 = 30, = 18, and X 4 = 32 in equation (37), the predicted grade in statistics is 
X x = 81. 

(b) Proceeding as in part (a) with X 2 = 18, X 3 = 20, and X 4 = 36, we find X x = 67. 



15.22 For the data of Problem 15.20, find the partial correlation coefficients (a) (b) r 

(C) ri4.23- 



SOLUTION 

(a) and (b) r nA = 



^(l-^Ki-'i) ^(i-^Ki"^) f23 ' 4 ^/O-'iXi-'i) 



Substituting the values from Problem 15.20, we obtain r UA = 0.7935, r l3A = 0.2215, and ;- 23 4 = 0.2791. 
Thus 



(c) 



"aA 1 - ^.4)0 -4.4) ' ' 4.4)0 -4.4) 

_ ''14 — r H f 34 f _ r \2 - r \l r 2i r _ r 24 ~ r 23 r 34 

^(l-ri 3 )(l -r§ 4 ) ^(l-rf 3 )(l-^ 3 ) ^ ^/(1 - r^)(l - r| 4 ) 

n 15.20, we obtain r U 3 = 

_ r 14.3 - r U.3 r 243 . 

V^-^.sKi-'i.s)" 



Substituting the values from Problem 15.20, we obtain r U3 = 0.4664, r 12 3 = 0.7939, and ;- 24 3 = 0.2791. 
Thus 



15.23 Interpret the partial correlation coefficients (a) r l2A , (b) r 134 , (c) r 1234 , (d) r l43 , and (e) r 1423 
obtained in Problem 15.22. 

SOLUTION 

(a) r 124 = 0.7935 represents the (linear) correlation coefficient between statistics grades and mathematics 
scores for students having the same general knowledge scores. In obtaining this coefficient, scores in 
English (as well as other factors that have not been taken into account) are not considered, as is 
evidenced by the fact that the subscript 3 is omitted. 
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(b) '"i3.4 = 0.2215 represents the correlation coefficient between statistics grades and English scores 
for students having the same general knowledge scores. Here, scores in mathematics have not been 
considered. 

i c ) r i2.34 = 0.7814 represents the correlation coefficient between statistics grades and mathematics scores 
for students having both the same English scores and general knowledge scores. 

id.) J"i4.3 = 0.4664 represents the correlation coefficient between statistics grades and general knowledge 
scores for students having the same English scores. 

(e) '"14.23 = 0.4193 represents the correlation coefficient between statistics grades and general knowledge 
scores for students having both the same mathematics scores and English scores. 



15.24 (a) For the data of Problem 15.20, show that 
ru.4 ~ r l}A r 2 3A 



^3.4X1-4.4) \l^-r\,,){l-r 2 2A3 ) 



(38) 



(b) Explain the significance of the equality in part (a). 
SOLUTION 

(a) The left-hand side of equation (38) is evaluated in Problem 15.22(a) yielding the result 0.7814. To 
evaluate the right-hand side of equation (38), use the results of Problem 15.22(c); again, the result is 
0.7814. Thus the equality holds in this special case. It can be shown by direct algebraic processes that 
the equality holds in general. 

(b) The left-hand side of equation (38) is '"12.34, and the right-hand side is r^.43. Since r 12 .34 is the correlation 
between variables X x and X 2 keeping X 3 and X 4 constant, while r 12 . 4 3 is the correlation between X x and 
X 2 keeping X 4 and X 3 constant, it is at once evident why the equality should hold. 

15.25 For the data of Problem 15.20, find (a) the multiple correlation coefficient 1.234 and (b) the 
standard error of estimate S1.234. 



1 - tfl.234 = (1 ~ r 2 n )(\ - r? 3 . 2 )(l - r? 4 . 23 ) or * L234 = 0.931 

since r X2 = 0.90 from Problem 15.20, ri 4 . 23 = 0.4193 from Problem 15.22(c), and 
ri3 2 = r n -r n r 2i = 0.75 - (0.90)(0.70) = Q ^ 
^/(l-rf 2 )(l-ry ^/[l - (0.90) 2 ][1 - (0.70) 2 )] 

Another method 

Interchanging subscripts 2 and 4 in the first equation yields 

1 - tfi.234 = (1 - ri4)0 - '"?3.4)(1 - '"12.34) or R L234 = 0.93 1 
where the results of Problem 15.22(a) are used directly. 



*1.234 = y I 2M Or *1. 234 = SlJl~ ^.234 = 10^ 1 - (0.9310) 2 = 3.6 
Compare with equation (8) of this chapter. 
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REGRESSION EQUATIONS INVOLVING THREE VARIABLES 

15.26 Using an appropriate subscript notation, write the regression equations of (a) X 3 on X\ and X 2 and (b) Xt on 

X u X 2 , X 3 , and X 5 . 

15.27 Write the normal equations corresponding to the regression equations of (a) X 2 on X x and X 3 and (b) X 5 on 

X u X 2 , X 3 , and X 4 . 

15.28 Table 15.4 shows the corresponding values of three variables: X u X 2 , and X 3 . 

(a) Find the least-squares regression equation of X 3 on X x and X 2 . 

(b) Estimate X 3 when X, = 10 and X 2 = 6. 

Table 15.4 



3 



14 



mathematics wished to determine the relationship of grades on a final examination 
to grades on two quizzes given during the semester. Calling X\, X 2 , and X 3 the grades of a student on 
the first quiz, second quiz, and final examination, respectively, he made the following computations for a 
total of 120 students: 



r n = 0.60 r 13 



= 0.80 
= 0.70 



r 23 = 0.65 



(a) Find the least-squares regression equation of X 3 on X x and X 2 . 

(b) Estimate the final grades of two students whose respective scores on the two quizzes were (1) 9 and 7 
and (2) 4 and 8. 

15.30 The data in Table 15.5 give the price in thousands (X x ), the number of bedrooms (X 2 ), and the number of 
baths (X 3 ) for 10 houses. Use the normal equations to find the least-squares regression equation of X x on X 2 
and Z 3 . Use EXCEL, MINITAB, SAS, SPSS, and STATISTIX to find the least-squares regression equation 
of X, on X 2 and X 3 . Use the least-squares regression equation of X x on X 2 and X 3 to estimate the price of a 
home having 5 bedrooms and 4 baths. 

Table 15.5 
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STANDARD ERROR OF ESTIMATE 

15.31 For the data of Problem 15.28, find the standard error of estimate of X 3 on X x and X 2 . 

15.32 For the data of Problem 15.29, find the standard error of estimate of (a) X 3 on X x and X 2 and (b) X x on 
X 2 and X 3 . 

COEFFICIENT OF MULTIPLE CORRELATION 

15.33 For the data of Problem 15.28, compute the coefficient of linear multiple correlation of X 3 on X { and X 2 . 

15.34 For the data of Problem 15.29, compute (a) R 3A2 , (b) R L23 , and (c) R 2 n . 

15.35 (a) If r, 2 = r l3 = r 13 = r ^ 1, show that 




(b) Discuss the case r = 1 . 

15.36 If R\ 23 = 0, prove that \r 23 \ > \r u \ and |r 23 | > |r 13 | and interpret. 

PARTIAL CORRELATION 

15.37 For the data of Problem 15.28, compute the coefficients of linear partial correlation r 12 . 3 , ^13.2, and r 23A . 
Also use STATISTIX to do these same computations. 

15.38 Work Problem 15.37 for the data of Problem 15.29. 

15.39 If r 12 = r 13 = r 23 = r / 1, show that r n 3 = r U 2 = r 23A =r/(l+ r). Discuss the case r=\. 

15.40 If r 12 3 = 1, show that (a) |r 13 2 | = 1, (b) \r 23A \ = 1, (c) i? L23 = 1, and (d) s L23 = 0. 

MULTIPLE AND PARTIAL CORRELATION INVOLVING FOUR OR MORE VARIABLES 

15.41 Show that the regression equation of X 4 on X u X 2 , and X 3 can be written 




where a x , a 2 , and a 3 are determined by solving simultaneously the equations 
a l r n +a 2 r n + a 3 r l3 = r l4 
air 2l +a 2 r 22 + a 3 r 23 = r 24 
air 3l +a 2 r 32 + a 3 r 33 = r 34 

and where x, = X,- - X h ry = 1, and j = 1,2,3, and 4. Generalize to the case of more than four variables. 
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15.42 Given X x = 20, X 2 = 36, X 3 = 12, X 4 = 80, s l = 1.0, s 2 = 2.0, s 3 = 1.5, s 4 = 6.0, r l2 = -0.20, r u = 0.40, 
r 23 = 0.50, f 14 = 0.40, r 24 = 0.30, and r 34 = -0.10, (a) find the regression equation of X 4 on X u X 2 , and X 3 , 
and (b) estimate JJT 4 when X, = 15, X 2 = 40, and Z 3 = 14. 

15.43 Find (a) r 4]23l (b) r 42A3 , and (c) r 43 12 for the data of Problem 15.42 and interpret your results. 

15.44 For the data of Problem 15.42, find (a) R 4A23 and (b) s 4A2} . 

15.45 The amount spent on medical expenses per year is correlated with other health factors for 1 5 adult males. 
A study collected the medical expenses per year, Y, as well as information on the following 
independent variables, 

f 0, if a non-smoker , , , , 

11 if a smoker = money spent on alcohol per week, 

X 3 = hours spent exercising per week, 

{0, dietary knowledge is low 
1 , dietary knowledge is average 
2, dietary knowledge is high 
X 5 = weight X 6 = age 

The notation given in this problem will be encountered in many statistics books. Y will be used for the 
dependent variable and Xwith subscripts for the independent variables. Using the data in Table 15.6, find 
the regression equation of Y on X x through X 6 by solving the normal equations and contrast this with 
the EXCEL, MINITAB, SAS, SPSS, and STATISTIX solutions. 



Table 15.6 



Medcost 


Smoker 


Alcohol 


Exercise 


Dietary 


Weight 




2100 





20 


5 


1 


185 


50 


2378 


1 


25 





1 


200 


42 


1657 





10 


10 


2 


175 


37 


2584 


1 


20 


5 


2 


225 


54 


2658 


1 


25 





1 


220 


32 


1842 








10 


1 


165 


34 


2786 


1 


25 


5 





225 


30 


2178 





10 


10 


1 


180 


41 


3198 


1 


30 





1 


225 


31 


1782 





5 


10 





180 


45 


2399 





25 


12 


2 


225 


45 


2423 





15 


15 





220 


33 


3700 


1 


25 





1 


275 


43 


2892 


1 


30 


5 


1 


230 


42 


2350 


1 


30 


10 


1 


245 


40 




Analysis of Variance 



THE PURPOSE OF ANALYSIS OF VARIANCE 

In Chapter 8 we used sampling theory to test the significance of differences between two sampling 
means. We assumed that the two populations from which the samples were drawn had the same variance. 
In many situations there is a need to test the significance of differences between three or more sampling 
means or, equivalently, to test the null hypothesis that the sample means are all equal. 

EXAMPLE 1 . Suppose that in an agricultural experiment four different chemical treatments of soil produced mean 
wheat yields of 28, 22, 18, and 24 bushels per acre, respectively. Is there a significant difference in these means, or is 
the observed spread due simply to chance? 

Problems such as this can be solved by using an important technique known as analysis of variance, developed 
by Fisher. It makes use of the F distribution already considered in Chapter 1 1 . 



ONE-WAY CLASSIFICATION, OR ONE-FACTOR EXPERIMENTS 

In a one-factor experiment, measurements (or observations) are obtained for a independent groups of 
samples, where the number of measurements in each group is b. We speak of a treatments, each of which 
has b repetitions, or b replications. In Example 1, a — 4. 

The results of a one-factor experiment can be presented in a table having a rows and b columns, as 
shown in Table 16.1. Here X jk denotes the measurement in the j'th row and A;th column, where 
j = \,2,...,a and where k = \,2,...,b. For example, X 35 refers to the fifth measurement for the third 
treatment. 



Table 16.1 



Treatment 2 



X u , X n ,...,X ih 



We shall denote by Xj the mean of the measurements in the /th row. We have 
X^.tx, j = l,2,...,a 



Copyright © 2008, 1999, 1988, 1961 by The McGraw-Hill Companies, Inc. Click here for terms of use. 
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The dot in Xj is used to show that the index k has been summed out. The values Xj are called group 
means, treatment means, or row means. The grand mean, or overall mean, is the mean of all the measure- 
ments in all the groups and is denoted by X: 



TOTAL VARIATION, VARIATION WITHIN TREATMENTS, AND VARIATION 
BETWEEN TREATMENTS 

We define the total variation, denoted by V, as the sum of the squares of the deviations of each 
measurement from the grand mean X 

Total variation = V = ^{X jk - X) 2 (3) 

By writing the identity 

X jk -X= (X jk - Xj.) + (Xj. - X) (4) 
and then squaring and summing over j and k, we have (see Problem 16.1) 

Ypjk - x? = E^ - ^ 2 + - ( J ) 

jjc j,k jjc 

E( z i* - * ) 2 = E(^* - ^-) 2 + h E^-- - ( 6 ) 

We call the first summation on the right-hand side of equations (5) and (6) the variation within treatments 
(since it involves the squares of the deviations of X jk from the treatment means Xj) and denote it by V w . 
Thus 

v w = E(^> " X jf ( 7 ) 

The second summation on the right-hand side of equations (J) and (6) is called the variation between 
treatments (since it involves the squares of the deviations of the various treatment means X u from the 
grand mean X) and is denoted by V B . Thus 

V B = J2(X j .-X) 2 = b'£(Xj-X) 2 (8) 
Equations (J) and (6) can thus be written 



SHORTCUT METHODS FOR OBTAINING VARIATIONS 

To minimize the labor of computing the above variations, the following forms are convenient: 
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where T is the total of all values X jk and where 7} is the total of all values in the j'th treatment: 

In practice, it is convenient to subtract some fixed value from all the data in the table in order to simplify 
the calculation; this has no effect on the final results. 



MATHEMATICAL MODEL FOR ANALYSIS OF VARIANCE 

We can consider each row of Table 16.1 to be a random sample of size b from the population for that 
particular treatment. The X jk will differ from the population mean \Xj for the j'th treatment by a chance 
error, or random error, which we denote by e jk ; thus 

Xjk = My + £jk (14) 

These errors are assumed to be normally distributed with mean and variance a 2 . If \i is the mean of 
the population for all treatments and if we let a t = ^ - fi, so that /iy = \l + a j7 then equation (14) 
becomes 

X jk = n + aj + e jk (15) 

where J2j a j = (see Problem 16.9). From equation (15) and the assumption that the e jk are normally 
distributed with mean and variance a 2 , we conclude that the X jk can be considered random variables 
that are normally distributed with mean \i and variance a . 

The null hypothesis that all treatment means are equal is given by (H Q : aj = 0; j = 1, 2, . . . , a) or, 
equivalently, by (H : fij — (j,; j = 1, 2, . . . , a). If H is true, the treatment populations will all have the 
same normal distribution (i.e., with the same mean and variance). In such cases there is just one treat- 
ment population (i.e., all treatments are statistically identical); in other words, there is no significant 
difference between the treatments. 



EXPECTED VALUES OF THE VARIATIONS 

It can be shown (see Problem 16.10) that the expected values of V w , V B , and V are given by 

E(V w ) = a(b-\)o> (16) 

E(V B ) = (a-l)<7 2 + bJ2 a j ( 17 ) 

E(V) = (ab - l)a 2 + bJ2 a ] (M) 

From equation (16) it follows that 

» .ha, ^'Wh) <-*» 

is always a best (unbiased) estimate of tr 2 regardless of whether H is true. On the other hand, we see 
from equations (16) and (18) that only if H is true (i.e., a 7 = 0) will we have 
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so that only in such case will 

S\ = -^- and S 2 =-*— (22) 
a — l ab — 1 

provide unbiased estimates of a 2 . If H is not true, however, then from equation (16) we have 

E(sl)=* 2 +^- x Y. a J W 



DISTRIBUTIONS OF THE VARIATIONS 

Using the additive property of chi-square, we can prove the following fundamental theorems 
concerning the distributions of the variations V w , V B , and V: 

Theorem 1: V w /a 2 is chi-square distributed with a(b - 1) degrees of freedom. 

Theorem 2: Under the null hypothesis H , V B /a 2 and V fa 2 are chi-square distributed 
with a — 1 and ab — 1 degrees of freedom, respectively. 

It is important to emphasize that Theorem 1 is valid whether or not H is assumed, whereas Theorem 2 is 
valid only if H is assumed. 



THE F TEST FOR THE NULL HYPOTHESIS OF EQUAL MEANS 

If the null hypothesis H Q is not true (i.e., if the treatment means are not equal), we see from equation 
(23) that we can expect S B to be greater than a 2 , with the effect becoming more pronounced as the 
discrepancy between the means increases. On the other hand, from equations (19) and (20) we can expect 
§w to be equal to a 2 regardless of whether the means are equal. It follows that a good statistic for testing 
hypothesis H is provided by S 2 B /S 2 W . If this statistic is significantly large, we can conclude that there is a 
significant difference between the treatment means and can thus reject H ; otherwise, we can either 
accept H or reserve judgment, pending further analysis. 

In order to use the S B /S%r statistic, we must know its sampling distribution. This is provided by 
Theorem 3. 

Theorem 3: The statistic F — S 2 B /S 2 W has the F distribution with a - 1 and a(b — 1) 
degrees of freedom. 

Theorem 3 enables us to test the null hypothesis at some specified significance level by using a one-tailed 
test of the T 7 distribution (discussed in Chapter 11). 



ANALYSIS-OF- VARIANCE TABLES 

The calculations required for the above test are summarized in Table 16.2, which is called an 
aiuilysis-of-variance table. In practice, we would compute V and V B by using either the long method 
[equations (3) and (8)] or the short method [equations (70) and (77)] and then by computing 
V w = V - V B . It should be noted that the degrees of freedom for the total variation (i.e., ab - 1) 
are equal to the sum of the degrees of freedom for the between-treatments and within-treatments 
variations. 



CHAP. 16] ANALYSIS OF VARIANCE 407 



Table 16.2 



Variation 


Degrees of Freedom 


Mean Square 


F 


Between treatments, 


a- 1 




Si 
Sjy 

with a - 1 and a(b - 1) 
degrees of freedom 


Within treatments, 
V w = V- V B 


a(b-l) 


sl- Vw 

Sw -a{b-\) 


Total, 

v=v B +v w 

= ^(^-JT) 2 


ab-\ 





MODIFICATIONS FOR UNEQUAL NUMBERS OF OBSERVATIONS 

In case the treatments I,..., a have different numbers of observations — equal to N { ,...,N a , 
respectively — the above results are easily modified, Thus we obtain 

F^(i Jt -i) 2 = x;4-| (24) 

j,k j j i 

V W =V-V B (26) 

where J2j k denotes the summation over k from 1 to Nj and then the summation over j from 1 to a. 
Table 16.3 is the analysis-of-variance table for this case. 



Table 16.3 



Variation 


Degrees of Freedom 


Mean Square 


F 


Between treatments 
V B = Y, Nj(Xj. - Xf 


a- 1 




s\ 

S 2 w 

with a - 1 and N-a 
degrees of freedom 


Within treatments, 
V w = V- V B 


N-a 




Total, 

v= V B + v w 
= Y^ ]k -xf 


N — 1 





TWO-WAY CLASSIFICATION, OR TWO-FACTOR EXPERIMENTS 



The ideas of analysis of variance for one-way classification, or one-factor experiments, can be 
generalized. Example 2 illustrates the procedure for two-way classification, or two-factor experiments. 
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EXAMPLE 2. Suppose that an agricultural experiment consists of examining the yields per acre of 4 different 
varieties of wheat, where each variety is grown on 5 different plots of land. Thus a total of (4) (5) = 20 plots are 
needed. It is convenient in such case to combine the plots into blocks, say 4 plots to a block, with a different variety 
of wheat grown on each plot within a block. Thus 5 blocks would be required here. 

In this case there are two classifications, or factors, since there may be differences in yield per acre due to (1) the 
particular type of wheat grown or (2) the particular block used (which may involve different soil fertility, etc.). 

By analogy with the agricultural experiment of Example 2, we often refer to the two factors in 
an experiment as treatments and blocks, but of course we could simply refer to them as factor 1 and 
factor 2. 



NOTATION FOR TWO-FACTOR EXPERIMENTS 

Assuming that we have a treatments and b blocks, we construct Table 16.4, where it is supposed that 
there is one experimental value (such as yield per acre) corresponding to each treatment and block. For 
treatment j and block k, we denote this value by X jk . The mean of the entries in the j'th row is denoted by 
Xj, where j= I,..., a, while the mean of the entries in the kth column is denoted by X k , where 
k = 1, . . . , b. The overall, or grand, mean is denoted by X. In symbols, 

k=\ j=\ j.k 



Treatment 1 
Treatment 2 



VARIATIONS FOR TWO-FACTOR EXPERIMENTS 

As in the case of one-factor experiments, we can define variations for two-factor experiments. 
We first define the total variation, as in equation (3), to be 



By writing the identity 

X jk -X= (X jk - Xj, - X k + X) + (Xj, -X) + (X k - X) (29) 
and then squaring and summing over j and k, we can show that 

V=V E +V R +V C (30) 
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where V E = variation due to error or chance = y^(Z /t - 1} - 2^ + JT) 2 

/,* 

= variation between rows (treatments) = b^(Xj - X) 2 

7=1 
b 

V c — variation between columns (blocks) = a^(X k - X) 2 
k=\ 

The variation due to error or chance is also known as the residual variation or random variation. 

The following, analogous to equations (10), (11), and (12), are shortcut formulas for computation: 




V E =V-V R -V C (34) 



where 7} is the total of entries in the j'th row, T k is the total of entries in the kth column, and T is the 
total of all entries. 



ANALYSIS OF VARIANCE FOR TWO-FACTOR EXPERIMENTS 

The generalization of the mathematical model for one-factor experiments given by equation (75) 
leads us to assume for two-factor experiments that 

Xj k = li + aj + k + e jk (35) 

where J] ctj — and ^2 Pk — 0- Here \i is the population grand mean, ctj is that part of X jk due to the 
different treatments (sometimes called the treatment effects), f3 k is that part of X jk due to the different 
blocks (sometimes called the block effects), and e jk is that part of X jk due to chance or error. As before, 
we assume that the e jk are normally distributed with mean and variance a 2 , so that the X jk are also 
normally distributed with mean /x, and variance a 2 . 

Corresponding to results (16), (17), and (18), we can prove that the expectations of the variations are 



given by 

E(V E ) = (a-l)(b-l)a 2 (36) 

E(V R )^(a-\)a 2 + bY J ^ (37) 

j 

E(V c ) = (b-\y + aY,Pk (38) 

k 

E(V) = (ab-\)a 2 + b^a 2 + a^(3 2 k (39) 

j k 

There are two null hypotheses that we would want to test: 

Hq * 1 : All treatment (row) means are equal; that is, ctj — 0, and j — 1, . . . , a. 
H^: All block (column) means are equal; that is, (3 k = 0, and k= \,...,b. 
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We see from equation (38) that, without regard to or Flf\ a best (unbiased) estimate of a 2 is 
provided by 

that is, E(S 2 E ) = a 2 (40) 



* (a-l)(b-l) 
Also, if hypotheses and are true, then 

will be unbiased estimates of ct 2 . If and //„ 2) are not true, however, then from equations (36) and 
(37), respectively, we have 

E(S 2 R ) = a 2 + ^- x Y. a J i 42 ) 

The following theorems are similar to Theorems 1 and 2: 

Theorem 4: V E /<r 2 is chi-square distributed with (a — 1) (b — 1) degrees of freedom, 
without regard to Z/^ 1 ' or . 

Theorem 5: Under hypothesis Hq\ V R ja 2 is chi-square distributed with a - 1 
degrees of freedom. Under hypothesis H^, V c ja 2 is chi-square distributed with 
b - 1 degrees of freedom. Under both hypotheses, and V jo 2 is chi-square 

distributed with ab — 1 degrees of freedom. 

To test hypothesis it is natural to consider the statistic S R /S E since we can see from equation 
(42) that S 2 R is expected to differ significantly from a 2 if the row (treatment) means are significantly 
different. Similarly, to test hypothesis we consider the statistic S 2 C /S 2 E . The distributions of S 2 R /S 2 E 
and Sc/S E are given in Theorem 6, which is analogous to Theorem 3. 

Theorem 6: Under hypothesis the statistic S 2 R /S 2 E has the F distribution with 
a- I and (a-\)(b-\) degrees of freedom. Under hypothesis the statistic 

Sc/S 2 ? has the F distribution with b - 1 and (a - \)(b - 1) degrees of freedom. 

Theorem 6 enables us to accept or reject Fl\^ or at specified significance levels. For convenience, as 
in the one-factor case, an analysis-of-variance table can be constructed as shown in Table 16.5. 



TWO-FACTOR EXPERIMENTS WITH REPLICATION 

In Table 16.4 there is only one entry corresponding to a given treatment and a given block. More 
information regarding the factors can often be obtained by repeating the experiment, a process called 
replication. In such case there will be more than one entry corresponding to a given treatment and a given 
block. We shall suppose that there are c entries for every position; appropriate changes can be made 
when the replication numbers are not all equal. 

Because of replication, an appropriate model must be used to replace that given by equation (35). 
We use 

Xjki = M + olj + P k + ljk + Eju (44) 

where the subscripts j, k, and / of X jk , correspond to the /th row (or treatment), the /rth column (or 
block), and the /th repetition (or replication), respectively. In equation (44) the y,, a j7 and (3 k are defined 



Table 16.5 



Variation 


Degrees of Freedom 


Mean Square 


F 


Between treatments, 
V R = bJ2 {Xj. - x? 


a- 1 


* = ^1 


with a - 1 and (a - 1)0- 1) 
degrees of freedom 


Between blocks, 
V c = aJ2(X.k-X) 2 


b- 1 


ic_ fc-l 


with ft - 1 and (a - 1)0- 1) 
degrees of freedom 


Residual or random, 


(a -1)0-1) 


c2 _ ^ 




v E = v-v R -v c 


(a -1)^-1) 




Total, 

V = V R + V c + V E 


ah - 1 







as before; e jkl is a chance or error term, while the j jk denote the row-column (or treatment-block) 
interaction effects, often simply called interactions. We have the restrictions 

^> 7 . = £& = = £^ = ° (* 5 ) 

and the A^y are assumed to be normally distributed with mean fi and variance a 2 . 

As before, the total variation V of all the data can be broken up into variations due to rows V R , 
columns V c , interaction V u and random or residual error V E : 



- (Xjkl - x) 2 
j,k,l 

= bcY j {X j .-Xf 

7=1 

-acf^(X, k -Xf 

= cJ2(Xj k .-Xj..-X k +X) 2 
j.k 



In these results the dots in the subscripts have meanings analogous to those given before; thus, for 
example, 

The expected values of the variations can be found as before. Using the appropriate number of degrees of 
freedom for each source of variation, we can set up the analysis-of-variance table as shown in 
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Table 16.6. The F ratios in the last column of Table 16.6 can be used to test the null hypotheses: 
All treatment (row) means are equal; that is, otj — 0. 
H^: All block (column) means are equal; that is, (3 k = 0. 

There are no interactions between treatments and blocks; that is, j jk = 0. 



Table 16.6 



Variation 


Degrees of Freedom 


Mean Square 


F 


V R 


a- 1 




S R /S E 
with a - 1 and ab(c - 1) 
degrees of freedom 


Between blocks, 

Vc 


b- 1 


Sc -b^\ 


Sc/Se 
with b- 1 and ab(c - 1) 
degrees of freedom 


Interaction, 
Vj 


(«-!)(*-!) 


' («-!)(*- 1) 


si/s 2 E 

with {a - l)(b- 1) and 
ab(c - 1) degrees of freedom 


Residual or random, 

V E 


ab(c-\) 


s 2 - Ve 

E ab(c-\) 




Total, 
V 


abc - 1 







From a practical point of view we should first decide whether or not Hq can be rejected at an 
appropriate level of significance by using the F ratio S]/Se of Table 16.6. Two possible cases then arise: 

1 . Z/^ 3 ' Cannot Be Rejected. In this case we can conclude that the interactions are not too large. We can 
then test and by using the F ratios S\lS 2 E and Sc/S 2 ?, respectively, as shown in Table 16.6. 
Some statisticians recommend pooling the variations in this case by taking the total of V I + V E and 
dividing it by the total corresponding degrees of freedom (a — \){b — 1) + ab(c — 1) and using this 
value to replace the denominator S 2 E in the F test. 

2. Z/^ 3 ' Can Be Rejected. In this case we can conclude that the interactions are significantly large. 
Differences in factors would then be of importance only if they were large compared with such inter- 
actions. For this reason, many statisticians recommend that and Hf" 1 be tested by using the Fratios 
S R /Sj and Sc/Sj rather than those given in Table 16.6. We, too, shall use this alternative procedure. 

The analysis of variance with replication is most easily performed by first totaling the replication 
values that correspond to particular treatments (rows) and blocks (columns). This produces a two-factor 
table with single entries, which can be analyzed as in Table 16.5. This procedure is illustrated in 
Problem 16.16. 



EXPERIMENTAL DESIGN 

The techniques of analysis of variance discussed above are employed after the results of an experi- 
ment have been obtained. However, in order to gain as much information as possible, the design of an 
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experiment must be planned carefully in advance; this is often referred to as the design of the experiment. 
The following are some important examples of experimental design: 

1. Complete Randomization. Suppose that we have an agricultural experiment as in Example 1. To 
design such an experiment, we could divide the land into 4 x 4 = 16 plots (indicated in Fig. 16-1 by 
squares, although physically any shape can be used) and assign each treatment (indicated by A, B, C, 
and D) to four blocks chosen completely at random. The purpose of the randomization is to 
eliminate various sources of error, such as soil fertility. 



Fig. 16-1 Complete randomization. 



C D 
D C 



Fig. 16-2 Randomized blocks. 



Fig. 16-4 Graeco-Latin square. 



Fig. 16-3 Latin square. 



2. Randomized Blocks. When, as in Example 2, it is necessary to have a complete set of treatments for 
each block, the treatments A, B, C, and D are introduced in random order within each block: I, II, 
III, and IV (i.e., the rows in Fig. 16-2), and for this reason the blocks are referred to as randomized 
blocks. This type of design is used when it is desired to control one source of error or variability: 
namely, the difference in blocks. 

3. Latin Squares. For some purposes it is necessary to control two sources of error or variability at the 
same time, such as the difference in rows and the difference in columns. In the experiment of 
Example 1 , for instance, errors in different rows and columns could be due to changes in soil fertility 
in different parts of the land. In such case it is desirable that each treatment occur once in each row 
and once in each column, as in Fig. 16-3. The arrangement is called a Latin square from the fact that 
the Latin letters A, B, C, and D are used. 

4. Graeco-Latin Squares. If it is necessary to control three sources of error or variability, a. Graeco-Latin 
square is used, as shown in Fig. 16-4. Such a square is essentially two Latin squares superimposed on 
each other, with the Latin letters A, B, C, and D used for one square and the Greek letters a, j3, 7, 
and 6 used for the other square. The additional requirement that must be met is that each Latin letter 
must be used once and only once with each Greek letter; when this requirement is met, the square is 
said to be orthogonal. 
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Solved Problems 

ONE-WAY CLASSIFICATION, OR ONE-FACTOR EXPERIMENTS 

16.1 Prove that V = V w + V B ; that is, 

£ (x jk -xf = Y. (** - x jf + £ ft. - 

/A y',fc y,fc 

SOLUTION 

We have X jk - X = (Jfy - + - X) 

Then, squaring and summing over j and £, we obtain 

£ (Xjk -x? = Y. - *jf + £ - x ? + 2 E - - *) 

J,k j,k j,k j,k 

To prove the required result, we must show that the last summation is zero. In order to do this, we proceed 
as follows: 

£ (x jk - Xj){Xj. -*) = £ ( x j. - *) - 

=E(^-^[(e^)-^] =° 

since X,. = ^ £ */* 

16.2 Verify that (a) T = abX, (b) 7}. = M}., and (c) ^ 7}. = abX, using the notation on page 362. 
SOLUTION 

(a) T = Yx }k = ab[^ b YXj)j=abX 

(*) r y- = £ = * £ = bX l 

(c) Since 7}. = J2k x jk> b Y P art («) we have 

£*}. = ££** = r = afc* 

16.3 Verify the shortcut formulas (70), (77), and (72) of this chapter. 
SOLUTION 

We have V = £ (X jk - X) 1 = £ ( x jk - 2XX jk + X 1 ) 

= E x l - 2X E x ik + ahxl 

= £ x fk ~ 2X(abX) + abX 2 
= Y x fk-abX 2 
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using Problem 16.2(a) in the third and last lines above. Simikirh. 

v b = E - = E - 2£ *>- + x 2 ) 

j,k j,k 
j,k j,k 

J,k v 7 ./,* 

= ^ E E ^ 2 - 2 ^( a ^) + 
= JE^ 2 -^ 2 

_ 1 2 T 2 

using Problem 16.2(6) in the third line and Problem 16.2(a) in the last line. Finally, equation (12) follows 
from that fact that V = V w + V B , or V w = V - V B . 

16.4 Table 16.7 shows the yields in bushels per acre of a certain variety of wheat grown in a particular 
type of soil treated with chemicals A, B, or C. Find (a) the mean yields for the different treat- 
ments, (b) the grand mean for all treatments, (c) the total variation, (d) the variation between 
treatments, and (e) the variation within treatments. Use the long method. (/) Give the EXCEL 
analysis for the data shown in Table 16.7. 



To simplify the arithmetic, we may subtract some suitable number, say 45, from all the data without 
affecting the values of the variations. We then obtain the data of Table 16.8. 

(a) The treatment (row) means for Table 16.8 are given, respectively, by 

X t =1(3 + 4+5 + 4) =4 X 2 . =i(2 + 4 + 3 + 3) = 3 Z 3 . = \ (4 + 6 + 5 + 5) = 5 

Thus the mean yields, obtained by adding 45 to these, are 49, 48, and 50 bushels per acre for A, B, and 
C, respectively. 

(b) The grand mean for all treatments is 

X = ^ (3 + 4+ 5 + 4 + 2 + 4 + 3 + 3 + 4 + 6 + 5 + 5) = 4 

Thus the grand mean for the original set of data is 45 + 4 = 49 bushels per acre. 

(c) The total variation is 

F =E ^ - x ? = ( 3 - 4 ) 2 + ( 4 - 4 ) 2 + ( 5 - 4 ) 2 + ( 4 - 4 ) 2 + ( 2 - 4 ) 2 + ( 4 - 4 ) 2 

+ (3 - 4) 2 + (3 - 4) 2 + (4 - 4) 2 + (6 - 4) 2 + (5 - 4) 2 + (5 - 4) 2 = 14 
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The variation between treatments is 

V B = bJ2 (Xj. - X? = 4[(4 - 4) 2 + (3 - 4) 2 + (5 - 4) 2 ] = 8 
The variation within treatments is 

V w = V - V B = 14 - 8 = 6 

Another method 

v w = E ( X * - X )? = ( 3 " 4 ) 2 + ( 4 - 4 ) 2 + ( 5 - 4 ) 2 + ( 4 - 4 ) 2 + ( 2 - 3 ) 2 + ( 4 - 
+ (3 - 3) 2 + (3 - 3) 2 + (4 - 5) 2 + (6 - 5) 2 + (5 - 5) 2 + (5 - 5) 2 = 6 
Note: Table 16.9 is the analysis-of-variance table for Problems 16.4, 16.5, and 16.6. 



Table 16.9 



Variation 


Degree of Freedom 


Mean Square 


F 


Between treatments, 
V B = 8 




« = ? = 4 


S% 4 

^ = V3 = 6 

with 2 and 9 
degrees of freedom 


Within treatments, 
V w = V - V B 
= 14-8 = 6 


fl (fc-l) = (3)(3)=9 




Total, 
V = 14 


oft - 1 = (3)(4) - 1 
= 11 





The EXCEL pull-down Tools — ► Data analysis — > Anova single factor gives the below analysis. 
The />-value tells us that the means for the three varieties are different at a = 0.05. 

ABC 

48 47 49 

49 49 51 

50 48 50 
49 48 50 

Anova: Single Factor 
SUMMARY 



Groups Count 


Sum 


Average 


Variance 




A 


4 


196 


49 


0.666667 




B 


4 


192 


48 


0.666667 




C 


4 


200 


50 


0.666667 




ANOVA 












Source of 












Variation 




SS 


df 


MS 


F P-value 


Between Groups 


8 


2 


4 


6 0.022085 


Within Groups 


6 


9 0. 


366667 




Total 




14 


11 
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Figure 16-5 shows a MINITAB dot plot for the yields of the three varieties of wheat. Figure 16-6 
shows a MINITAB box plot for the yields of the three varieties of wheat. The EXCEL analysis and the 
MINITAB plots tell us that variety C significantly out yields variety B. 



Dotplot of yields vs treatment 



Yields 

Fig. 16-5 A MINITAB dot plot for the yields of the three 



Boxplot of yield vs variety 




Variety 

Fig. 16-6 A MINITAB box plot for the yields of the three 



Referring to Problem 16.4, find an unbiased estimate of the population variance a from (a) the 
variation between treatments under the null hypothesis of equal treatment means and (b) the 
variation within treatments, (c) Refer to the MINITAB output given in the solution to Problem 



16.4 and locate the 



estimates computed in parts (a) and (b). 



(b) 

(c) The variance estimate Sj 
Factor MS = 4.000 

The vari; 
Error MS = 4.000 



the s; 



e as Si. 



a(b-\) 3(4-1) 3 
s the Factor mean square 



the MINITAB output. That is 
square in the MINITAB output. That is 



Referring to Problem 16.4, can we reject the null hypothesis of equal means at significance levels 
of (a) 0.05 and (b) 0.01? (c) Refer to the MINITAB output given in the solution to Problem 16.4 
to test the null hypothesis of equal means. 
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SOLUTION 

We have 



vith a - 



= 3- 



S 2 W 2/3 

1 = 2 degrees of freedom and a(b - 1) = 3(4 - 1) = 9 degrees of freedom. 



(a) Referring to Appendix V, with u l = 2 and u 2 = 9, we see that F 95 = 4.26. Since F = 6 > F 95 , we can 
reject the null hypothesis of equal means at the 0.05 level. 

(b) Referring to Appendix VI, with v x =2 and v 2 = 9, we see that F M = 8.02. Since F = 6 < F M , we 
cannot reject the null hypothesis of equal means at the 0.01 level. 

(c) By referring to the M1NITAB output found in Problem 16.4, we find that the computed value of F is 
6.00 and the /j-value is equal to 0.022. Thus the smallest pre-set level of significance at which the null 
hypothesis would be rejected is 0.022. Therefore, the null hypothesis would be rejected at 0.05, but not 
at 0.01. 

16.7 Use the shortcut formulas (70), (11), and (12) to obtain the results of Problem 16.4. In addition, 
use M1NITAB on the data, from which 45 has been subtracted from each value, to obtain the 
analysis-of-variance-table. 

SOLUTION 

It is convenient to arrange the data as in Table 16.10. 



4 5 4 
4 3 3 



E4 = 206 



T = J2 r ; = 48 Yl T ? : = 8 



(a) Using formula (70), we have 

J2 x fk = 9 + 16 + 25 + 16 + 4 + 16 + 9 + 9 + 16 + 36 + 25 + 25 = 206 



and 



T= 3+4+ 5+4+2+4+3+3+4+6+5+5= 



(48) / 



: 206 - 192 = 14 



"-54-5— (3)(4) 

(b) The totals of the rows are 

r,. = 3 + 4 + 5 + 4= 16 T 2 , =2 + 4 + 3 + 3= 12 T 3 , = 4 + 6 + 5 + 5 = 20 



and 

Thus, using formula (77), we have 



T= 16+ 12 + 20 = 4 



K fl = i E ^-g = i(16 2 +12 2 + 20V^^00-192 = 8 



(c) Using formula (72), we have 



V w = F-F 5 =14-8 = 6 



CHAP. 16] 



ANALYSIS OF VARIANCE 



419 



The results agree with those obtained in Problem 16.4, and from this point on the analysis proceeds as 

The MIN1TAB pull-down Stat — > Anova — ► Oneway gives the following output. Note the differences in 
terminology used. The variation within treatments is called Within Groups in the EXCEL output and 
Error in the MINITAB output. The variation between treatments is called Between Groups in the 
EXCEL output and Factor in the MINITAB output. You have to get used to the different terminology 
used by the various software packages. 

One-way ANOVA: A, B, C 

Source DF SS MS F P 

Factor 2 8.000 
Error 9 6.00 

Total 11 14.000 

S=0.8165 R-Sq=57.14% R-Sq( adj ) =47 . 62% 



16.8 A company wishes to purchase one of five different machines: A, B, C, D, or E. In an experiment 
designed to test whether there is a difference in the machines' performance, each of five experi- 
enced operators works on each of the machines for equal times. Table 16.1 1 shows the numbers of 
units produced per machine. Test the hypothesis that there is no difference between the machines 
at significance levels of (a) 0.05 and (b) 0.01. (c) Give the STATIST1X solution to the problem and 
test the hypothesis that there is no difference between the machines using the /rvalue approach. 
Use a = 0.05. 



68 72 77 42 53 
72 53 63 53 48 



60 82 64 75 72 



48 61 57 64 



64 65 70 68 53 





Tj. 


Tj. 


A 


8 12 17 -18 -7 


12 


144 


B 


12 -7 3 -7 -12 


-11 


121 


C 


22 4 15 12 


53 


2809 


D 


-12 1 -3 4 -10 


-20 


400 


E 


4 5 10 8 -7 


20 


400 




e4 = 2658 


54 


3874 



SOLUTION 

Subtract a suitable number, say 60, from all the data to obtain Table 16.12. Then 



K. = ^-S=™- 116.64 = 658.16 
5 (5)(4) 



We now form Table 16.13. For 4 and 20 degrees of freedom, we have F 95 = 2. 87. Thus we cannot reject 
the null hypothesis at the 0.05 level and therefore certainly cannot reject it at the 0.01 level. 

The STATISTIX pull-down Statistics — ► One, two, multi-sample tests — > One-way Anova gives the 
following output. 
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Statistix 8 . 

One-way AOV for: A B C D E 



Source DF SS MS F P 

Between 4 658.16 164.540 1.75 0.1792 

Within 20 1883.20 94.160 

Total 24 2541.36 

Grand Mean 62 . 160 CV 15.61 

Variable Mean 

A 62.400 

B 57.800 

C 70.600 

D 56.000 

E 64.000 



The p-value is 0.1792. There are no significant differences in the population means. 



Table 16.13 



Variation 


Degrees of Freedom 


Mean square 


F 


Between treatments 
V B = 658.2 


a- 1 =4 


SJ-^- 164.5 


164 1 5 5 = 
94.16 


Within treatments 
V w = 1883.2 


fl (ft-l) = (5)(4)=20 






Total 
V = 2514.4 


ab - 1 = 24 







MODIFICATIONS FOR UNEQUAL NUMBERS OF OBSERVATIONS 

16.9 Table 16.14 shows the lifetimes in hours of samples from three different types of television tubes 
manufactured by a company. Using the long method, determine whether there is a difference 
between the three types at significance levels of (a) 0.05 and (b) 0.01. 



Table 16.14 



Sample 1 


407 411 409 


Sample 2 


404 406 408 405 402 


Sample 3 


410 408 406 408 



SOLUTION 

It is convenient to subtract a suitable number from the data, say 400, obtaining Table 16.15. This table 
shows the row totals, the sample (or group) means, and the grand mean. Thus we have 
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V = E {Xjk - X? = (7 - i f + (11 - if + ■ ■ ■ + (8 - i f = 12 

v B = E - = E W - = 3 ( 9 - 7 ) 2 + 5 ( 7 - 5 ) 2 + 4 ( 8 - 7 ) 2 = 36 

j,k j 

V w = V - V B = 72 - 36 = 36 
We can also obtain V w directly by observing that it is equal to 

(7 - 9) 2 + (1 1 - 9) 2 + (9 - 9) 2 + (4 - 5) 2 + (6 - 5) 2 + (8 - 5) 2 + (5 - 5) 2 
+ (2 - 5) 2 + (10 - 8) 2 + (8 - 8) 2 + (6 - 8) 2 + (8 - 8) 2 



Table 16.15 





Total 




Sample 1 


7 119 


27 


9 


Sample 2 


4 6 8 5 2 


25 


5 


Sample 3 


10 8 6 8 


32 


8 




X = grand mean = j| = 7 



The data can be summarized as in Table 16.16, the analysis-of-variance table. Now for 2 and 9 degrees 
of freedom, we find from Appendix V that F 95 = 4.26 and from Appendix VI that F 99 = 8.02. Thus we can 
reject the hypothesis of equal means (i.e., no difference between the three types of tubes) at the 0.05 level but 
not at the 0.01 level. 



Table 16.16 



Variation 


Degree of Freedom 


Mean Square 


F 


Kb = 36 


a-l=2 




S 2 B 18 
S 2 w 4 
= 4.5 


IV = 36 


N-a = 9 





16.10 Work Problem 16.9 by using the shortcut formulas included in equations (24), (25), and (26). 
In addition, give the SAS solution to the problem. 

SOLUTION 

From Table 16.15 we have V, = 3, N 2 = 5, N 3 = 4, N = 12, 7\ = 27, T 2 . = 25, T y = 32, and T = 84. 
Thus we have 

I/ = E^-^ = 72 + ll2 + --- + 62 + 82 -n^ = 72 

^Tj, T 2 (2lf (25f (32) 2 (84) 2 

v B = 2 r w j -^ = ^ + ^ + —4 rT = 36 

V W =V-V B = 36 
Using these, the analysis of variance then proceeds as in Problem 16.9. 
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The SAS pull-down Statistics — > ANOVA — > Oneway ANOVA gives the following output. 



The ANOVA Procedure 



3 Level Inf( 



Number of Observations Read 
Number of Observations Used 
The ANOVA Procedure 



Dependent Variable : lifetii 
Model 

Corrected Total 



Sum of Squares 
36.00000000 
36.00000000 
72 .00000000 
R- Square 
. 500000 
Anova SS 
36.00000000 



Mean Square 
18 .00000000 
4 . 00000000 

coef f Var 
0.491400 
Mean Square 
18 . 00000000 



4.50 



Root MSE 
2 .000000 
F Value 
4.50 



' . 0000 



Source DF 
Sample_ 2 

Note that SAS calls the variation between treatments model and variation within treatments it calls 
error. The test statistic is called F value and equals 4.50. The /rvalue is represented Pr>F and this 
equals 0.0442. At a = 0.05 the lifetimes would be declared unequal. 



TWO-WAY CLASSIFICATION, OR TWO-FACTOR EXPERIMENTS 

16.11 Table 16. 17 shows the yields per acre of four different plant crops grown on lots treated with three 
different types of fertilizer. Using the long method, determine at the 0.01 significance level 
whether there is a difference in yield per acre (a) due to the fertilizers and (b) due to the crops, 
(c) Give the MINITAB solution to this two-factor experiment. 

SOLUTION 

Compute the row totals, the row means, the column totals, the column means, the grand total, and the 
grand mean, as shown in Table 16.18. From this table we obtain: 

The variation of row means from the grand mean is 

V R = 4[(6.2 - 6.8) 2 + (8.3 - 6.8) 2 + (5.9 - 6.8) 2 ] = 13.68 

The variation of column means from the grand mean is 

V c = 3[(6.4 - 6.8) 2 + (7.0 - 6.8) 2 + (7.5 - 6.8) 2 + (6.3 - 6.8) 2 ] = 2.82 



Table 16.17 





Crop I 


Crop II 


Crop III 


Crop IV 


Fertilizer A 


4.5 


6.4 


7.2 


6.7 


Fertilizer B 


8.8 


7.8 


9.6 


7.0 


Fertilizer C 


5.9 


6.8 


5.7 


5.2 



Table 16.18 





Crop I 


Crop II 


Crop III 


Crop IV 


Total 




Fertilizer A 


4.5 


6.4 


7.2 


6.7 


24.8 


6.2 


Fertilizer B 


8.8 


7.8 


9.6 


7.0 


33.2 


8.3 


Fertilizer C 


5.9 


6.8 


5.7 


5.2 


23.6 


5.9 


Column total 


19.2 


21.0 


22.5 


18.9 


Grand total = 81.6 


Column mean 








6.3 


Grand mean = 6.8 


The total variation 
V = 


4.5 -6.8) 2 


+ (6.4-6.8) 


2 + (7.2 -6. 


8) 2 + (6.7-6.8) 2 






+ (8.8-6.8 


) 2 + (7.8-6 


8) 2 + (9.6- 


6.8) 2 + (7.0 


-6.8) 2 






+ (5.9-6.8 


) 2 + (6.8-6 


8) 2 + (5.7- 


6.8) 2 + (5.2 


-6.8) 2 = 23. 


)8 



The random variation is 

V E =V- V R — V c = 6.58 
This leads to the analysis of variance in Table 16.19. 



Table 16.19 



Variation 


Degree of 
Freedom 


Mean Square 


F 


V R = 13.68 


2 


S 2 R = 6.84 


S\/S 2 E = 6.24 
with 2 and 6 
degrees of freedom 


V c = 2.82 


3 


S 2 C = 0.94 


S 2 C /S 2 E = 0.86 
with 3 and 6 
degrees of freedom 


V E = 6.58 


6 


Si = 1.097 




V = 23.08 


11 







At the 0.05 significance level with 2 and 6 degrees of freedom, F 95 = 5.14. Then, since 6.24 > 5.14, we 
can reject the hypothesis that the row means are equal and conclude that at the 0.05 level there is a significant 
difference in yield due to the fertilizers. 

Since the F value corresponding to the differences in column means is less than 1, we can conclude that 
there is no significant difference in yield due to the crops. 

(c) The data structure for the MIN1TAB worksheet is given first, followed by the MINITAB analysis of the 
two-factor experiment. 
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MTB > Twoway 'Yield' 'Crop' 
SUBC > Means 'Crop' 'Fertil: 



Two-way Analysis of Varia 

Analysis of Variance f 



Crop 3 2.82 

Fertiliz 2 13.68 



Total 11 



Mean 
6.40 
7.00 
7.50 
6.30 



6.58 



•idual 95% CI 



6.00 
'idual 95% CI 



The data structure in the worksheet must correspond exactly to the data as given in Table 16.17. The 
first row, 114.5, corresponds to Crop 1, Fertilizer 1, and Yield 4.5, the second row, 1 2 8.8, corresponds 
to Crop 1, Fertilizer 2, and Yield 8.8, etc. A mistake that is often made in using statistical software is to set 
up the data structure in the worksheet incorrectly. Make certain that the data given in a table like Table 
16.17 and the data structure in the worksheet correspond in a one-to-one manner. Note that the two-way 
analysis of variance table given in the MINITAB output contains the same information given in Table 16.19. 
The p-values given in the MINITAB output allow the researcher to test the hypothesis of interest without 
consulting tables of the F distribution to find critical values. The p-value for crops is 0.512. This is the 
minimum level of significance for which we could reject a difference in mean yield for crops. The mean yields 
for the four crops are not statistically significant at 0.05 or 0.01. The p-value for fertilizers is 0.034. This tells 
us that the mean yields for the three fertilizers are statistically different at 0.05 but not statistically different 
at 0.01. 

The confidence intervals for the means of the four crops shown in the MINITAB output reinforce 
our conclusion of no difference in mean yields for the four different crops. The confidence intervals for 
the three fertilizers indicates that it is likely that Fertilizer B produces higher mean yields than either 
Fertilizer A or C. 
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16.12 Use the short computational formulas to obtain the results of Problem 16.11. 
In addition, give the SPSS solution to Problem 16.11. 

SOLUTION 

From Table 16.18 we have 

J2 x fk = (4-5) 2 + (6.4) 2 + • • • + (5.2) 2 = 577.96 

T = 24.8 + 33.2 + 23.6 = 81.6 
Y, T l = ( 24 - 8 ) 2 + ( 33 - 2 ) 2 + ( 23 - 6 ) 2 = 2274.24 
J2 T l = (19-2) 2 + (21.0) 2 + (22.5) 2 + (18.9) 2 = 1673.10 

Then V = J2 x jk - ^ = 577 - 96 - 554 - 88 = 23.08 

j,k a 

V k = \H T 1^ 1 ^, = \ (2274.24) - 554.88 = 13.68 
V c = - a Y. r.l-^ = ^ (1673.10) - 554.88 = 2.82 
V E = V — V R — V c = 23.08 - 13.68 - 2.82 = 6.58 

in agreement with Problem 16.11. 

The SPSS pull-down Analyze — » General Linear Model — > Univariate gives the following output. 
Tests of Between-Subjects Effects 



Dependent Variable: Yield 



Source 


Type 1 Sum 
of Squares 


df 


Mean Square 


F 


Sig. 


Corrected Model 


16.500 3 


5 


3.300 


3.009 


.106 


Intercept 


554.880 


1 


554.880 


505.970 


.000 


Crop 


2.820 


3 


.940 


.857 


.512 


Fert 


13.680 


2 


6.840 


6.237 


.034 


Error 


6.580 


6 


1.097 






Total 


577.960 


12 








Corrected Total 


23.080 


11 









a R Squared = .715 (Adjusted R Squared = .477) 



Note that the test statistic is given by f and for crops the f value is 0.857 with the corresponding 
/rvalue of 0.512. The f value for fertilizer is 6.237 with the corresponding /rvalue of 0.034. These values 
correspond with the values in Table 16.19 as well as the MINITAB output in Problem 16.11. 



TWO-FACTOR EXPERIMENTS WITH REPLICATION 



16.13 A manufacturer wishes to determine the effectiveness of four types of machines (A, B, C, and D) 
in the production of bolts. To accomplish this, the numbers of defective bolts produced by each 
machine in the days of a given week are obtained for each of two shifts; the results are shown in 
Table 16.20. Perform an analysis of variance to determine at the 0.05 significance level whether 
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this is a difference (a) between the machines and (b) between the shifts, (c) Use MINITAB to 
perform the analysis of variance and test for differences between machines and differences 
between shifts using the /rvalue approach. 



SOLUTION 

The data can be equivalent!) organized as in Table 16.21. In this table the two main factors are 
indicated: the machine and the shift. Note that two shifts have been indicated for each machine. The 
days of the week can be considered to be replicates (or repetitions) of performance for each machine for 
the two shifts. The total variation for all the data of Table 16.21 is 



K = 6 2 + 4 2 + 5 2 + --- + 7 2 + 10 2 - 



,268) 



= 1946- 1795.6 = 150.4 



Factor I: 
Machine 


Factor II: 
Shift 


Replicates 




Mon. Tue. Wed. Thu. Fri. 


Total 


A 


{2 


6 4 5 5 4 
5 7 468 


24 
30 


B 


{. 


10 8 7 7 9 
7 9 12 8 8 


41 
44 


C 


{I 


7 5 6 5 9 
9 7 5 4 6 


32 
31 


D 


{I 


8 4 6 5 5 
5 7 9 7 10 


28 
38 




Total 


57 51 54 47 59 


268 



In order to consider the two main factors (the machine and the shift), we limit our attention to the total 
of replication values corresponding to each combination of factors. These are arranged in Table 16.22, which 
is thus a two-factor table with single entries. The total variation for Table 16.22, which we shall call the 
subtotal variation V s . is given by 



_ (24) 2 (41) 2 (32) 2 (28) 2 (30) 2 (44) 2 (31) 2 (38) 2 _ (268) 2 



= 1861.2- 1795.6 = 65.6 
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The variation between rows is given by 



The variation between columns is given by 

v _(125) 2 (143) 2 (268) 2 _ 



1803.7- 1795.6 = 8.1 



- V c = 65.6 -51.0 -8.1 = 6.5 



If we now subtract from the subtotal variation V s the sum of the variations between the rows and columns 
(V R + V c ), we obtain the variation due to the interaction between the rows and columns. This is given by 

V, = V s - 

Finally, the residual variation, which we can think of as the random or error variation V E (provided that we 
believe that the various days of the week do not provide any important differences), is found by subtracting 
the subtotal variation (i.e., the sum of the row, column, and interaction variations) from the total variation 
V. This yields 

V E = V - (V R + V c + Vj) = V - V s = 150.4 - 65.6 = 84.8 

These variations are shown in Table 16.23, the analysis of variance. The table also gives the number of 
degrees of freedom corresponding to each type of variation. Thus, since there are four rows in Table 16.22, 



Table 16.23 



Variation 


Degrees of 
Freedom 


Mean Square 


F 


Rows (machines), 
^ = 51.0 


3 


S 2 R = 17.0 




Columns (shifts), 
K c = 8.1 


1 


5c = 8.1 




Interaction, 
V, = 6.5 


3 


S] = 2.167 




Subtotal, 
V s = 65.6 


7 






Random or residual, 
V E = 84.8 


32 


S\ = 2.65 


Total, 
V = 150.4 


39 
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the variation due to rows has 4-1 = 3 degrees of freedom, while the variation due to the two columns has 
2—1 = 1 degree of freedom. To find the degrees of freedom due to interaction, we note that there are eight 
entries in Table 16.22; thus the total degrees of freedom are 8—1 = 7. Since 3 of these 7 degrees of freedom 
are due to rows and 1 is due to columns, the remainder [7 - (3 + 1) = 3] are due to interaction. Since there 
are 40 entries in the original Table 16.21, the total degrees of freedom are 40 - 1 = 39. Thus the degrees of 
freedom due to random or residual variation are 39 - 7 = 32. 

First, we must determine whether there is any significant interaction. The interpolated critical value for 
the F distribution with 3 and 32 degrees of freedom is 2.90. The computed lvalue for interaction is 0.817 
and is not significant. There is a significant difference between machines, since the computed F value for 
machines is 6.42 and the critical value is 2.90. The critical value for shifts is 4.15. The computed value of F 
for shifts is 3.06. There is no difference in defects due to shifts. 

The data structure for MINITAB is shown below. Compare the data structure with the data given in 
Table 16.21 to see how the two sets of data compare. 



Machii 



Shift Defects 



40 



10 



The command MTB > Twoway 'Defects' 'Machine' 'Shifts' is used to produce the two-way 
analysis of variance. The p-value for interaction is 0.494. This is the minimum level of significance for which 
the null hypothesis would be rejected. Clearly, there is not a significant amount of interaction between shifts 
and machines. The p-value for shifts is 0.090. Since this exceeds 0.050, the mean number of defects for the 
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two shifts is not significantly different. The ;?-value for machines is 0.002. The mean numbers of defects for 



the four machines a 


re significantly different a 


the 0.050 level of significance. 


MTB > Twoway 'Defects' 'Machine' ' Sh 


Lft' 




Two-way Analysis of Var 


ance 






Analysis of Var 




for Defects 






Source 


DF 


SS 


MS 


F P 


Machine 


3 


51.00 17 


00 


6.42 0.002 


Shift 


1 


8. 10 8 


10 


3.06 0.090 


Interaction 


3 


6.50 2 


17 


0.82 0.494 


Error 


32 


84.80 2 


65 




Total 


39 


150.40 







Figure 16-7 gives the interaction plot for shifts and machines. The plot indicates possible interaction 
between shifts and machines. However, the p-value for interaction in the analysis of variance table tells us 
that there is not a significant amount of interaction. When interaction is not present, the broken line graphs 
for shift 1 and shift 2 are parallel. The main effects plot shown in Fig. 16-8 indicates that machine 1 produced 
the fewest defects on the average in this experiment and that machine 2 produced the most. There were more 
defects produced on shift 2 than shift 1. However, the analysis of variance tells us that this difference is not 
statistically significant. 



Shift 




12 3 4 

Machine 



Fig. 16-7 Interaction plot — data means for defects. 



Machine Shift 




Fig. 16-8 Main effects plot — data means for defects. 



430 



ANALYSIS OF VARIANCE 



[CHAP. 16 



SOLUTION 



A 


B 


C 


D 


E 






machinel 


machine2 


machine3 


machine4 




shiftl 












shiftl 


4 


8 


5 


4 




shiftl 


5 


7 


6 


6 




shiftl 


5 


7 


5 


5 




shiftl 


4 


9 


9 


5 




shift2 


5 


7 


9 _, 






shift2 


7 


9 








shift2 


4 


12 


5 


g 




shift2 


6 


8 


4 


7 




shift2 


8 


8 


6 


10 




Anova: Two-Factor With Replication 








SUMMARY 


machinel 


machine2 


machine3 


machine4 


Total 


shiftl 












Count 


5 


5 


5 


5 


20 


Sum 


24 


41 


32 


28 


125 


Average 


4.8 


8.2 


6.4 


5.6 


6.25 


Variance 


0.7 


1.7 


2.8 


2.3 


3.25 


shift2 










Count 


5 


5 


5 


5 


20 


Sum 


30 


44 


31 


38 


143 


Average 


6 


8.8 


6.2 


7.6 


7.15 


Variance 


2.5 


3.7 


3.7 


3.8 


4.239474 


Total 












Count 


10 


10 


10 


10 




Sum 


54 


85 


63 


66 




Average 


5.4 


8.5 


6.3 


6.6 




Variance 


1 .822222 


2.5 


2.9 


3.822222 




ANOVA 












Source of 












Variaton 


SS 


df 


MS 


F 


P-value 


Sample 


8.1 


1 


8.1 


3.056604 


0.089999 


Columns 


51 


3 


17 


6.415094 


0.001584 


Interaction 


6.5 


3 


2.166667 


0.81761 


0.49371 


Within 


84.8 


32 


2.65 






Total 


150.4 


39 









Set the data up in the EXCEL worksheet as shown. The pull-down Tools — > Data analysis — > Anova: 
Two-Factor with Replication gives the dialog box in Fig. 16-9 which is filled out as shown. 
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a: Two-Factor With Replicatio 



Input 

Input Range: 
Rows per sample: 
Alpha: 

Jut put options 
Output Range: 
ONew Worksheet Ply: 
O New Workbook 




Fig. 16-9 EXCEL dialog box for Problem 16.14. 

In the ANOVA, identify sample with shift and machine with column. Compare this output with the 
MINITAB output of Problem 16.13. 



LATIN SQUARES 



16.15 A farmer wishes to test the effects of four different fertilizers (A, B, C, and D) on the yield of 
wheat. In order to eliminate sources of error due to variability in soil fertility, he uses the 
fertilizers in a Latin-square arrangement, as shown in Table 16.24, where the numbers indicate 
yields in bushels per unit area. Perform an analysis of variance to determine whether there is a 
difference between the fertilizers at significance levels of (a) 0.05 and (b) 0.01. (c) Give the 
MINITAB solution to this Latin-square design, (d) Give the STATISTIX solution to this 
Latin-square design. 

SOLUTION 

We first obtain totals for the rows and columns, as shown in Table 16.25. We also obtain total yields for 
each of the fertilizers, as shown in Table 16.26. The total variation and the variations for the rows, columns, 
and treatments are then obtained as usual. We find: 

The total variation is 

V = (18) 2 + (21) 2 + (25) 5 + • • • + (10) 2 + (17) - ^ 
= 5769 - 5439.06 = 329.94 



C 19 
D 24 



A 15 
C 23 
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The variation between r< 



V R = + - 

= 5468.25 - 5439.06 = 29.19 



The variation between columns is 



V c 



_ (77) 2 (74) 2 (73) 2 (71) 2 (295) 2 



= 5443.75 - 5439.06 = 4.69 
The variation between treatments is 

_ (70) 2 , (48) 2 , (85) 2 , (92) 2 (295) : 



v B = - 



Table 16.27 shows the analysis of 



16 



= 5723.25- 5439.06 = 284.19 



Variation 


Degrees of 
Freedom 


Mean Square 


F 


Rows, 29.19 


3 


9.73 


4.92 


Columns, 4.69 


3 


1.563 


0.79 


Treatments, 284.19 


3 


94.73 


47.9 


Residuals, 11.87 


6 


1.978 




Total, 329.94 


15 







Since F95 3 6 = 4.76, we can reject at the 0.05 level the hypothesis that there are equal row means. It 
follows that at the 0.05 level there is a difference in the fertility of the soil from one row to another. 

Since the lvalue for columns is less than 1, we conclude that there is no difference in soil fertility in 
the columns. 

Since the F value for treatments is 47.9 > 4.76, we can conclude that there is a difference between 
fertilizers. 

Since ^.99,3,6 = 9.78, we can accept the hypothesis that there is no difference in soil fertility in the rows 
(or the columns) at the 0.01 level. However, we must still conclude that there is a difference between 
fertilizers at the 0.01 level. 
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(c) The structure of the data file for the MINITAB worksheet is given first. 
Row Rows Columns Treatment Yield 



Note that rows and columns in the farm layout are numbered 1 through 4. Fertilizers A through D ii 
Table 16.24 are coded as 1 through 4, respectively, in the worksheet. The MINITAB pull-down Stat — 
ANOVA — ► General Linear Model gives the following output. 



General Linear Model: Yield versus Rows, Columns, Treatment 





Type Le 


vels Values 




Rows 




4 1, 2, 3, 4 




Columns 


fixed 


4 1, 2, 3, 4 




Treatment 


fixed 


4 1, 2, 3, 4 




Analysis o 


f Variance for Yi 


Id, using Adjusted SS for Test 






DF Seq SS 


Adj SS Adj MS F 


P 




3 29.188 


29.188 9.729 4.92 


0.047 


Columns 


3 4.688 


4.687 1.562 0.79 


0.542 


Treatment 


3 284.188 2 


84.188 94.729 47.86 


0.000 


Error 


6 11.875 


11.875 1.979 




Total 


15 329.938 






S=l. 40683 


R-Sq=96.40% 


R-sq(adj)-91.00% 





The MINITAB output is the same as that computed by hand above. It indicates a difference in 
fertility from row to row at the 0.05 level but not at the 0.01 level. There is no difference in fertility from 
column to column. There is a difference in the four fertilizers at the 0.01 level. 

(d) The STATISTIX pull-down Statistics -> Linear Models -> Analysis of Variance Latin Square Design 
gives the following output. 



Latin Square AOV Table for Yield 

Source DF SS MS 

Rows 3 29.188 9.7292 

Columns 3 4.688 1.5625 

Treatment 3 284.188 94.7292 

Error 6 11.875 1.9792 

Total 15 329.938 
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GRAECO-LATIN SQUARES 

16.16 It is of interest to determine whether there is any significant difference in mileage per gallon 
between gasolines A, B, C, and D. Design an experiment that uses four different drivers, 
four different cars, and four different roads. 

SOLUTION 

Since the same number of gasolines, drivers, cars, and roads is involved (four), we can use a Graeco- 
Latin square. Suppose that the different cars are represented by the rows and that the different drivers are 
represented by the columns, as shown in Table 16.28. We now assign the different gasolines {A, B, C, and D) 
to the rows and columns at random, subject only to the requirement that each letter appears just once 
in each row and just once in each column. Thus each driver will have an opportunity to drive each car and to 
use each type of gasoline, and no car will be driven twice with the same gasoline. 

We now assign at random the four roads to be used, denoted by a, B, 7, and 8, subjecting them to the 
same requirement imposed on the Latin squares. Thus each driver will also have the opportunity to drive 
along each of the roads. Table 16.28 shows one possible arrangement. 



Table 16.28 





Driver 


12 3 4 


Car 1 


B 1 


A a 


D 6 


c a 


Car 2 


A 6 


B a 


c 7 


D, 3 


Car 3 




Cs 


Bp 


A,., 


Car 4 


Cg 


D 1 


A a 


B s 



16.17 Suppose that in carrying out the experiment of Problem 16.16 the numbers of miles per gallon are 
as given in Table 16.29. Use analysis of variance to determine whether there are any differences at 
the 0.05 significance level. Use MINITAB to obtain the analysis of variance and use the ^-values 
provided by MINITAB to test for any differences at the 0.05 significance level. 



Table 16.29 





Driver 


12 3 4 


Car 1 


B 1 19 


A a 16 


D 6 16 


C a 14 


Car 2 


A t 15 


B a 18 


C, 11 


Dp 15 


Car 3 


D a 14 


Q 11 


B s 21 


A 1 16 


Car 4 


Cp 16 


D 1 16 


A a 15 


B 6 23 
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We first obtain the row and column totals, as shown in Table 16.30. We then obtain the totals for each 
Latin letter and for each Greek letter, as follows: 





ICi U i 1^1 \c 

1 j + 10 + 1 j -p 10 — 




B total: 


19+18 + 21+23 = 


81 


C total: 


16+ 11 + 11 + 14 = 


52 


D total: 


14+16+16+15 = 


61 


a total: 


14+18+15+14 = 


61 


P total: 


16+16 + 21 + 15 = 


68 


7 total: 


19+16+11 + 16 = 


62 


6 total: 


15+ 11 + 16 + 23 = 


65 



We now compute the variations corresponding to all of these, using the shortcut method: 

- = 4112.50-4096 = 16.50 



(65) 2 (59) 2 (62) 2 (70) 2 (256) 2 
4 4 4 4 16 

(64) 2 (61) 2 (63) 2 (68) 2 (256) 2 



4102.50 -4096 = 6.50 



Gasohnes( W .Z)): ^ + ^ + ^ + ^-^ = 4207.50 -4096 = 111.50 
4 4 4 4 16 

Roads (a,/?, 7, 5): ^ + ^ + ^ + ^-^ = 4103.50 -4096 = 7.50 

v ' ; 4 4 4 4 16 

The total variation is 

(19) 2 + (16) 2 + (16) 2 + • • • + (15) 2 + (23) 2 - = 4244 - 4096 = 148.00 

so that the variation due to error is 

148.00 - 16.50- 6.50- 111.50- 7.50 = 6.00 

The results are shown in Table 16.31, the analysis of variance. The total number of degrees of freedom is 
N 2 - 1 for an N x N square. Each of the rows, columns, Latin letters, and Greek letters has N - 1 degrees of 
freedom. Thus the degrees of freedom for error are N 2 - 1 - 4(7V - 1) = (N - l)(N - 3). In our 



Table 16.30 







Total 




B 1 19 


A s 16 


D b 16 


C a 14 


65 


A s 15 


B a 18 


a, 11 


D a 15 


59 


D a 14 


Q 11 


Bp 21 


A 1 16 


62 


C s 16 


D 1 16 


A a 15 


B s 23 


70 


| Total 


64 


61 


63 


68 


256 
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Variation 


Degrees of 
Freedom 


Mean Square 


F 


Rows (cars), 
16.50 






5^ 
2.000 


Columns (drivers), 
6.50 


3 


2.167 


^Z-108 
2.000 


Gasolines (A,B,C,D), 
111.50 




37.167 


37.167 lg6 
2.000 ~~ 


Roads (a,/3,7,<S), 
7.50 


3 


2.500 


2 ^°°=1.25 
2.000 


6.00 


3 


2.000 




Total, 
148.00 


15 







We have F 95 Jy3 = 9.28 and F 99 , 3j3 = 29.5. Thus 
ne at the 0.05 level but not at the 0.01 level. 

of the data file for the MINITAB worksheet 



reject the hypothesis that the gasolines are the 



Note that cars and drivers are numbered the same in the MINITAB worksheet as in Table 16.29. 
Gasoline brands A through D in Table 16.29 are coded as 1 through 4, respectively, in the worksheet. 
Roads a, (3, 7, and S, in Table 16.29 are coded as 1,2, 3, and 4 in the worksheet. The MINITAB pull- 
down Stat — > ANOVA — ► General Linear Model gives the following output. 



General Linear Model: Mpg versus Car, Driver, Gasoline, Road 
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Analysis of Varia 


nee for Mpg, 


USl 


ng Adj 


isted SS f O] 


Tests 




Source DF 


Seq SS 


Ad 


j SS 


Adj MS 


F 


P 


Car 3 


16.500 


16 


500 


5.500 


2 . 75 


0. 214 


Driver 3 


6.500 


6 


500 


2.167 


1.08 


0.475 


Gasoline 3 


111.500 


111 


500 


37. 167 


18.58 


0.019 


Road 3 


7.500 


7 


500 


2.500 


1.25 


0.429 


Error 3 


6.000 


6 


000 


2.000 






Total 15 


148.000 













The column labeled seq ms in the MINITAB output is the same as the Mean square column in 
Table 16.31. The computed lvalues in the MINITAB output are the same as those given in Table 16.31. 
The ^-values for cars, drivers, gasoline brands, and roads are 0.214, 0.475, 0.019, and 0.429, respectively. 
Recall that a />-value is the minimum value for a pre-set level of significance at which the hypothesis of 
equal means for a given factor can be rejected. The ^-values indicate no differences for cars, drivers, or 
roads at the 0.01 or 0.05 levels. The means for gasoline brands are statistically different at the 0.05 level 
but not at the 0.01 level. Further investigation of the means for the brands would indicate how the 
means differ. 



MISCELLANEOUS PROBLEMS 

16.18 Prove [as in equation (75) of this chapter] that J2j a i — 0- 
SOLUTION 

The treatment population means ^, and the total population mean \x are related by 

M=*J>, (53) 
Then, since a, = fij - fi, we have, using equation (53), 

a J = -f J ') = Yl »i -afx = (54) 

16.19 Derive (a) equation (16) and (b) equation (77) of this chapter. 
SOLUTION 

(a) By definition, we have 

V W = Y.(x ik - x,) 1 = b£\\t V* ~ x jf] = b t S f 

j,k 7=1 V k=l J j=\ 

where Sf is the sample variance for the /th treatment. Then, since the sample size is b, 
E(V W ) = b^E(Sj) = ^Ef^ir = a{b - ^ 

7=1 7=1 ^ ' 

(b) By definition 

v B = b (Xj. - x ? = h x l ~ 2hx x .i- + ahxl = h Y x l~ abx2 

since X = (Y,jXj.)/a. Then, omitting the summation index, we have 

E( V B )=bY. E ( X f) ~ abE{X 2 ) (55) 
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Now for any random variable U, E(U 2 ) = var (U) + [E(U)f , where var (U) denotes the variance of U. 
Thus 

E(X 2 ) = var (X/) + [E(Xj)f (56) 
E(X 2 ) = V ar(X) + [E(X)] 2 (57) 
But since the treatment populations are normal with mean /i, = /i + a h we have 



«(*,)=y 


(58) 




(59) 


E{Xj.)=fij = n + aj 


(60) 


E(X) = fi 


(61) 



Using results (56) to (61) together with result (53), we have 

W)=*E[y+C^ + ay) a ]-^ + M l ] 
= aa 2 + A E (/x + aj f - a 2 - abf? 
= (a - \)a 2 + abf/ + 2b\i ^ a } + b^a 2 + abf/ 
= (a~l)a 2 + bJ2* 2 

16.20 Prove Theorem 1 in this chapter. 
SOLUTION 

As shown in Problem 16.19, 

where S 2 is the sample variance for samples of size b drawn from the population of treatment / We see that 
bS 2 1 a 2 has a chi-square distribution with b - 1 degrees of freedom. Thus, since the variances Sf are 
independent, we conclude from page 264 that V w /a 2 is chi-square-distributed with a(b — 1) degrees of 
freedom. 
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Supplementary Problems 



ONE-WAY CLASSIFICATION, OR ONE-FACTOR EXPERIMENTS 

In all of the supplementary problems, the reader is encouraged to work the problems "by hand" 
using the equations given in the chapter before using the suggested software to work the problem. This 
will increase your understanding of the ANOVA technique as well as help you appreciate the power 
of the software. 



16.21 An experiment is performed to determine the yields of five different varieties of wheat: A, B, C, D, and E. 
Four plots of land are assigned to each variety, and the yields (in bushels per acre) are as shown in Table 16.32. 
Assuming that the plots are of similar fertility and that the varieties are assigned to the plots at random, 
determine whether there is a difference between the yields at significance levels of (a) 0.05 and (b) 0.01. (c) Give 
the MINITAB analysis for this one-way classification or one-factor experiment. 

Table 16.32 



16.22 A company wishes to test four different types of tires: A, B, C, and D. The tires' lifetimes, as determined from 
their treads, are given (in thousands of miles) in Table 16.33, where each type has been tried on six similar 
automobiles assigned at random to the tires. Determine whether there is a significant difference between the 
tires at the (a) 0.05 and (b) 0.01 levels, (c) Give the STATISTIX analysis for this one-way classification or 
one-factor experiment. 



A 33 38 36 40 31 35 
B 32 40 42 38 30 34 
C 31 37 35 33 34 30 



16.23 A teacher wishes to test three different teaching methods: I, II, and III. To do this, three groups of five 
students each are chosen at random, and each group is taught by a different method. The same examination 
is then given to all the students, and the grades in Table 16.34 are obtained. Determine whether there is a 
difference between the teaching methods at significance levels of (a) 0.05 and (b) 0.01. (c) Give the EXCEL 
analysis for this one-way classification or one-factor experiment. 



75 62 71 58 73 



73 79 60 75 81 
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MODIFICATIONS FOR UNEQUAL NUMBERS OF OBSERVATIONS 



16.24 Table 16.35 gives the numbers of miles to the gallon obtained by similar automobiles using five different 
brands of gasoline. Determine whether there is a difference between the brands at significance levels of 
(a) 0.05 and (b) 0.01. (c) Give the SPSS analysis for this one-way classification or one-factor experiment. 



Table 16.35 Table 16.36 

Brandt 12 15 14 11 15 Mathematics 72 80 83 75 

Brandt 14 12 15 Science 81 74 77 

Brand C 11 12 10 14 English 88 82 90 87 80 

Brand/) 15 18 16 17 14 Economics 74 71 77 70 



16.25 During one semester a student received grades in various subjects, as shown in Table 16.36. Determine 
whether there is a significant difference between the student's grades at the (a) 0.05 and (b) 0.01 levels, 
(c) Give the SAS analysis for this one-way classification or one-factor experiment. 



TWO-WAY CLASSIFICATION, OR TWO-FACTOR EXPERIMENTS 



16.26 Articles manufactured by a company are produced by three operators using three different machines. The 
manufacturer wishes to determine whether there is a difference (a) between the operators and (b) between the 
machines. An experiment is performed to determine the number of articles per day produced by each 
operator using each machine; the results are shown in Table 16.37. Provide the desired information using 
MINITAB and a significance level of 0.05. 



Table 16.37 





Operator 


1 2 3 


Machine A 
Machine B 
Machine C 


23 27 24 
34 30 28 
28 25 27 



Table 16.38 





Type of Corn 


I 


II 


III 


IV 


Block A 


12 


15 


10 


14 


Block B 


15 


19 


12 




Block C 


14 


18 


15 


12 


Block D 


11 


16 


12 


16 


Block E 


16 


17 


11 


14 



16.27 Use EXCEL to work Problem 16.26 at the 0.01 significance level. 

16.28 Seeds of four different types of corn are planted in five blocks. Each block is divided into four plots, which 
are then randomly assigned to the four types. Determine at the 0.05 significance level whether the yields in 
bushels per acre, as shown in Table 16.38, vary significantly with differences in {a) the soil (i.e., the five 
blocks) and (b) the type of corn. Use SPSS to build your ANOVA. 

16.29 Work Problem 16.28 at the 0.01 significance level. Use STATISTIX to build your ANOVA. 

16.30 Suppose that in Problem 16.22 the first observation for each type of tire is made using one particular kind of 
automobile, the second observation is made using a second particular kind, and so on. Determine at the 0.05 
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significance level whether there is a difference (a) between the types of tires and (b) between the kinds of 
automobiles. Use SAS to build your ANOVA. 



16.31 Use MINITAB to work Problem 16.30 at the 0.01 level of significance. 



16.32 Suppose that in Problem 16.23 the first entry for each teaching method corresponds to a student at one 
particular school, the second method corresponds to a student at another school, and so on. Test the 
hypothesis at the 0.05 significance level that there is a difference (a) between the teaching methods and 
(b) between the schools. Use STATISTIX to build your ANOVA. 



16.33 An experiment is performed to test whether the hair color and heights of adult female students in the United 
States have any bearing on scholastic achievement. The results are given in Table 16.39, where the numbers 
indicate individuals in the top 10% of those graduating. Analyze the experiment at a significance level of 
0.05. Use EXCEL to build your ANOVA. 

Table 16.39 Table 16.40 



Medium 
Short 



Redhead 



Blonde 



Brunette 



16 18 20 



18 22 21 



17 18 24 20 



16.34 Work Problem 16.33 at the 0.01 significance level. Use SPSS to build your ANOVA and compare your 
output with the EXCEL output of Problem 16.33. 



TWO-FACTOR EXPERIMENTS WITH REPLICATION 



16.35 Suppose that the experiment of Problem 16.21 was carried out in the southern part of the United States and 
that the columns of Table 16.32 now indicate four different types of fertilizer, while a similar experiment 
performed in the western part gives the results shown in Table 16.40. Determine at the 0.05 significance level 
whether there is a difference in yields due to (a) the fertilizers and (b) the locations. Use MINITAB to build 
your ANOVA. 

16.36 Work Problem 16.35 at the 0.01 significance level. Use STATISTIX to build your ANOVA and compare 
your output with the MINITAB output of Problem 16.35. 

16.37 Table 16.41 gives the number of articles produced by four different operators working on two different types 
of machines, I and II, on different days of the week. Determine at the 0.05 level whether there are significant 



Table 16.41 





Machine I 






Machine 


II 








Wed. 


Thu. 


Fri. 


Mon. 




Wed. 


Thu. 


Fri. 


Operato 


r A 


15 


18 


17 


20 


12 


14 


16 


18 


17 


15 


Operato 


r B 


12 


16 


14 


18 


11 


11 


15 


12 


16 


12 


Operato 


r C 


14 


17 


18 


16 


13 


12 


14 


16 


14 


11 


Operato 


rZ> 


19 


16 


21 


23 


18 


17 


15 


18 


20 


17 
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differences (a) between the operators and (b) between the machines. Use SAS and MINITAB to build your 
ANOVA. 



LATIN SQUARES 



16.38 An experiment is performed to test the effect on corn yield of four different fertilizer treatments (A, B, C, 
and D) and of soil variations in two perpendicular directions. The Latin square of Table 16.42 is obtained, 
where the numbers show the corn yield per unit area. Test at the 0.01 significance level the hypothesis that 
there is no difference between (a) the fertilizers and (b) the soil variations. Use STATISTIX to build your 
ANOVA. 



Table 16.42 



A 10 
C 12 



E 75 
M 81 



Table 16.43 

W 78 
E 76 



M 80 
W 79 



16.39 Work Problem 16.38 at the 0.05 significance level. Use MINITAB to build your ANOVA and compare your 
output with the STATISTIX output of Problem 16.38. 

16.40 Referring to Problem 16.33, suppose that we introduce an additional factor — giving the section E, M, or W 
of the United States in which a student was born, as shown in Table 16.43. Determine at the 0.05 level 
whether there is a significant difference in the scholastic achievements of female students due to differences in 
(a) height, (b) hair color, and (c) birthplace. Use SPSS to build your ANOVA. 



GRAECO-LATIN SQUARES 



16.41 In order to produce a superior type of chicken feed, four different quantities of each of two chemicals are 
added to the basic ingredients. The different quantities of the first chemical are indicated by A, B, C, and D, 
while those of the second chemical are indicated by a, fi, 7, and S. The feed is given to baby chicks arranged 
in groups according to four different initial weights ( W u W 2 , W 3 , and W 4 ) and four different species (Si, S 2 , 
S 3 , and &,). The increases in weight per unit time are given in the Graeco-Latin square of Table 16.44. 
Perform an analysis of variance of the experiment at the 0.05 significance level, stating any conclusions that 
can be drawn. Use MINITAB to build your ANOVA. 



Table 16.44 





w l 


W 2 


W 3 


W 4 




C, 8 


Bp 6 


A a 5 


D 6 6 


s 2 


A 6 4 


D a 3 


Q 7 


5,, 3 


S3 


Dp 5 


A, 6 


B 6 5 


C a 6 


s 4 


B a 6 


Q 10 


D 1 10 


Ap 8 



16.42 Four different types of cable (T u T 2 , T 3 , and T 4 ) are manufactured by each of four companies (d, C 2 , C 3 , 
and C 4 ). Four operators (A, B, C, and D) using four different machines (a, (3, 7, and S) measure the 
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cable strengths. The average strengths obtained are given in the Graeco-Latin square of Table 16.45. 
Perform an analysis of variance at the 0.05 significance level, stating any conclusions that can be drawn. 
Use SPSS to build your ANOVA. 



MISCELLANEOUS PROBLEMS 



16.43 Table 16.46 gives data on the accumulated rust on iron that has been treated with chemical A, B, or C, 
respectively. Determine at the (a) 0.05 and (b) 0.01 levels whether there is a significant difference in the 
treatments. Use EXCEL to build your ANOVA. 





C\ 


c 2 


c 3 


Q 


T\ 


A 164 


B 1 181 


C a 193 


D s 160 


T 2 


C s 171 


D a 162 


A 1 183 


B 145 


T 3 


D 1 198 


C fj 212 


B s 207 


A a 188 


T 4 


B a 157 


A, 172 


D 166 


C,, 136 



16.44 An experiment measures the intelligence quotients (IQs) of adult male students of tall, short, and medium 
stature. The results are given in Table 16.47. Determine at the (a) 0.05 and (b) 0.01 significance levels whether 
there is any difference in the IQ scores relative to the height differences. Use MINITAB to build your 
ANOVA. 



Short 
Medium 



16.45 Prove results (10), (11), and (12) of this chapter. 



16.46 An examination is given to determine whether veterans or nonveterans of different IQs performed better. 
The scores obtained are shown in Table 16.48. Determine at the 0.05 significance level whether there is a 
difference in scores due to differences in (a) veteran status and (b) IQ. Use SPSS to build your ANOVA. 



Table 16.48 





Test Score 




High 


Medium 






IQ 


IQ 


IQ 


Veteran 


90 


81 


74 


Nonveteran 


85 


78 


70 
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16.47 Use STATISTIX to work Problem 16.46 at the 0.01 significance level. 



16.48 Table 16.49 shows test scores for a sample of college students who are from different parts of the country and 
who have different IQs. Analyze the table at the 0.05 significance level and state your conclusions. 
Use MINITAB to build your ANOVA. 



16.49 Use SAS to work Problem 16.48 at the 0.01 significance level. 



16.50 In Problem 16.37, can you determine whether there is a significant difference in the number of articles 
produced on different days of the week? Explain. 



Table 16.49 





Test Score 




High 


Medium 






IQ 


IQ 


IQ 


East 


88 


80 


72 


West 


84 


78 


75 


South 


86 


82 


70 


North and central 


80 


75 


79 



16.51 In analysis-of-variance calculations it is known that a suitable constant can be added or subtracted from 
each entry without affecting the conclusions. Is this also true if each entry is multiplied or divided by a 
suitable constant? Justify your answer. 



16.52 Derive results (24), (25), and (26) for unequal numbers of observations. 



16.53 Suppose that the results in Table 16.46 of Problem 16.43 hold for the northeastern part of the United States, 
while the corresponding results for the western part are those given in Table 16.50. Determine at the 0.05 
significance level whether there are differences due to (a) chemicals and (b) location. Use MINITAB to build 
your ANOVA. 



20 10 20 15 



16.54 Referring to Problems 16.21 and 16.35, suppose that an additional experiment performed in the northeastern 
part of the United States produced the results given in Table 16.51. Determine at the 0.05 significance level 
whether there is a difference in yields due to (a) the fertilizers and (b) the three locations. Use STATISTIX to 
build your ANOVA. 
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16.55 Work Problem 16.54 at the 0.01 significance level. Use MINITAB to build your ANOVA. 



16.56 Perform an analysis of variance on the Latin square of Table 16.52 at the 0.05 significance level and state 
your conclusions. Use SPSS to build your ANOVA. 



Table 16.52 

Factor 1 



B 16 

A 18 



A 15 
C 14 



16.57 Make up an experiment leading to the Latin square of Table 16.52. 



16.58 Perform an analysis of variance on the Graeco-Latin square of Table 16.53 at the 0.05 significance level and 
state your conclusions. Use SPSS to build your ANOVA. 



Table 16.53 

Factor 1 



A 1 6 


B s 12 


Q 4 


D a 18 


B 6 3 


A a 8 


D 1 15 


Cp 14 


Da 15 


C, 20 


B a 9 


A s 5 


C a 16 


D 6 6 


A 17 


B 1 7 



16.59 Make up an experiment leading to the Graeco-Latin square of Table 16.53. 

16.60 Describe how to use analysis-of-variance techniques for three-factor experiments with replications. 

16.61 Make up and solve a problem that illustrates the procedure in Problem 16.60. 

16.62 Prove (a) equation (30) and (b) results (37) to (34) of this chapter. 

16.63 In practice, would you expect to find (a) a 2 x 2 Latin square and (b) a 3 x 3 Graeco-Latin square? Explain. 



Nonparametric Tests 



INTRODUCTION 

Most tests of hypotheses and significance (or decision rules) considered in previous chapters require 
various assumptions about the distribution of the population from which the samples are drawn. For 
example, the one-way classification of Chapter 16 requires that the populations be normally distributed 
and have equal standard deviations. 

Situations arise in practice in which such assumptions may not be justified or in which there is doubt 
that they apply, as in the case where a population may be highly skewed. Because of this, statisticians 
have devised various tests and methods that are independent of population distributions and associated 
parameters. These are called nonparametric tests. 

Nonparametric tests can be used as shortcut replacements for more complicated tests. They are 
especially valuable in dealing with nonnumerical data, such as arise when consumers rank cereals or 
other products in order of preference. 

THE SIGN TEST 

Consider Table 17.1, which shows the numbers of defective bolts produced by two different types of 
machines (I and II) on 12 consecutive days and which assumes that the machines have the same total 
output per day. We wish to test the hypothesis H that there is no difference between the machines: that 
the observed differences between the machines in terms of the numbers of defective bolts they produce 
are merely the result of chance, which is to say that the samples come from the same population. 



Table 17.1 



Day 




1 2 3 


4 


5 


6 7 


8 9 


10 11 12 


Machine I 


47 56 54 


49 


36 


48 51 


38 61 


49 56 52 


Machine 11 


71 63 45 


64 


50 


55 42 


46 53 


57 75 60 



A simple nonparametric test in the case of such paired samples is provided by the sign test. This test 
consists of taking the difference between the numbers of defective bolts for each day and writing only the 
sign of the difference; for instance, for day 1 we have 47-71, which is negative. In this way we obtain 
from Table 17.1 the sequence of signs 

- + -■- + - + --- (1) 
446 
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(i.e., 3 pluses and 9 minuses). Now if it is just as likely to get a + as a -, we would expect to get 6 of each. 
The test of H is thus equivalent to that of whether a coin is fair if 12 tosses result in 3 heads (+) and 9 
tails (-). This involves the binomial distribution of Chapter 7. Problem 17.1 shows that by using a 
two-tailed test of this distribution at the 0.05 significance level, we cannot reject H ; that is, there 
is no difference between the machines at this level. 

Remark 1: If on some day the machines produced the same number of defective bolts, 
a difference of zero would appear in sequence (7). In such case we can omit these sample 
values and use 11 instead of 12 observations. 

Remark 2: A normal approximation to the binomial distribution, using a correction 
for continuity, can also be used (see Problem 17.2). 

Although the sign test is particularly useful for paired samples, as in Table 17.1, it can also be used 
for problems involving single samples (see Problems 17.3 and 17.4). 



THE MANN-WHITNEY U TEST 

Consider Table 17.2, which shows the strengths of cables made from two different alloys, I and II. In 
this table we have two samples: 8 cables of alloy I and 10 cables of alloy II. We would like to decide 
whether or not there is a difference between the samples or, equivalently, whether or not they come from 
the same population. Although this problem can be worked by using the t test of Chapter 11, a non- 
parametric test called the Mann-Whitney U test, or briefly the U test, is useful. This test consists of the 
following steps: 



Table 17.2 



Alloy I 


Alloy II 


18.3 16.4 22.7 17.8 
18.9 25.3 16.1 24.2 


12.6 14.1 20.5 10.7 15.9 
19.6 12.9 15.2 11.8 14.7 



Step 1. Combine all sample values in an array from the smallest to the largest, and assign ranks 
(in this case from 1 to 18) to all these values. If two or more sample values are identical (i.e., there are 
tie scores, or briefly ties), the sample values are each assigned a rank equal to the mean of the ranks 
that would otherwise be assigned. If the entry 18.9 in Table 17.2 were 18.3, two identical values 
18.3 would occupy ranks 12 and 13 in the array so that the rank assigned to each would be 
i(12 + 13) = 12.5. 

Step 2. Find the sum of the ranks for each of the samples. Denote these sums by R\ , and R 2 , where 
N { and N 2 are the respective sample sizes. For convenience, choose N { as the smaller size if they are 
unequal, so that N x < N 2 . A significant difference between the rank sums 7^ and R 2 implies a significant 
difference between the samples. 

Step 3. To test the difference between the rank sums, use the statistic 

U = N l N 2 + N ^± l l-R i (2) 

corresponding to sample 1 . The sampling distribution of U is symmetrical and has a mean and variance 
given, respectively, by the formulas 



NiN 2 n N l N 2 (N i +N 2 +l) 
^ u ~ 2 ° i > = 12 



(3) 
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If TV! and N 2 are both at least equal to 8, it turns out that the distribution of U is nearly normal, 
so that 



is normally distributed with mean and variance 1. Using Appendix II, we can then decide whether the 
samples are significantly different. Problem 17.5 shows that there is a significant difference between the 
cables at the 0.05 level. 

Remark 3: A value corresponding to sample 2 is given by the statistic 

U = N X N 2 + N * N 1 + V-R 2 (5) 

and has the same sampling distribution as statistic (2), with the mean and variance of 
formulas (3). Statistic (5) is related to statistic (2), for if U\ and U 2 are the values 
corresponding to statistics (2) and (5), respectively, then we have the result 

U y + U 2 = N X N 2 (6) 

We also have 

*,+« : = ^ (7) 

where N = N { + N 2 . Result (7) can provide a check for calculations. 

Remark 4: The statistic U in equation (2) is the total number of times that sample 1 
values precede sample 2 values when all sample values are arranged in increasing order 
of magnitude. This provides an alternative counting method for finding U. 



THE KRUSKAL-WALLIS H TEST 

The U test is a nonparametric test for deciding whether or not two samples come from the same 
population. A generalization of this for k samples is provided by the Kruskal-Wallis H test, or briefly the 
H test. 

This test may be described thus: Suppose that we have k samples of sizes N\, N 2 , . . . , N k , with the 

total size of all samples taken together being given by N = N { + N 2 H hiV t . Suppose further that the 

data from all the samples taken together are ranked and that the sums of the ranks for the k samples are 
Ri, R 2 ,...,R k , respectively. If we define the statistic 

^T^tf-W+D ( S ) 

then it can be shown that the sampling distribution of His very nearly a chi-squarc distribution with k - 1 
degrees of freedom, provided that N u N 2 , . . . , N k are all at least 5. 

The H test provides a nonparametric method in the analysis of variance for one-way classification, or 
one-factor experiments, and generalizations can be made. 



THE H TEST CORRECTED FOR TIES 



In case there are too many ties among the observations in the sample data, the value of H given by 
statistic (8) is smaller than it should be. The corrected value of H, denoted by H c , is obtained by dividing 



CHAP. 17] 



NONPARAMETRIC TESTS 



44'-) 



the value given in statistic (8) by the correction factor 



where T is the number of ties corresponding to each observation and where the sum is taken over all the 
observations. If there are no ties, then T — and factor (9) reduces to 1, so that no correction is needed. 
In practice, the correction is usually negligible (i.e., it is not enough to warrant a change in the decision). 



THE RUNS TEST FOR RANDOMNESS 

Although the word "random" has been used many times in this book (such as in "random sam- 
pling" and "tossing a coin at random"), no previous chapter has given any test for randomness. 
A nonparametric test for randomness is provided by the theory of runs. 

To understand what a run is, consider a sequence made up of two symbols, a and b, such as 

a a\bbb\a\bb\a a a a a\b b b \ a a a a\ (10) 

In tossing a coin, for example, a could represent "heads" and b could represent "tails." Or in sampling 
the bolts produced by a machine, a could represent "defective" and b could represent "nondefective." 

A run is defined as a set of identical (or related) symbols contained between two different symbols or 
no symbol (such as at the beginning or end of the sequence). Proceeding from left to right in sequence 
(10), the first run, indicated by a vertical bar, consists of two a's; similarly, the second run consists of 
three b's, the third run consists of one a, etc. There are seven runs in all. 

It seems clear that some relationship exists between randomness and the number of runs. Thus for 
the sequence 

a\b\a\b\a\b\a\b\a\b\a\b\ (11) 

there is a cyclic pattern, in which we go from a to b, back to a again, etc., which we could hardly believe 
to be random. In such case we have too many runs (in fact, we have the maximum number possible for 
the given number of a's and b's). 

On the other hand, for the sequence 

a a a a a a \ b b b b \ a a a a a \ b b b \ (12) 

there seems to be a trend pattern, in which the a's and b's are grouped (or clustered) together. In such case 
there are too few runs, and we would not consider the sequence to be random. 

Thus a sequence would be considered nonrandom if there are either too many or too few runs, and 
random otherwise. To quantify this idea, suppose that we form all possible sequences consisting of N\ a's 
and N 2 b's, for a total of TV symbols in all (N { + N 2 = N). The collection of all these sequences provides 
us with a sampling distribution: Each sequence has an associated number of runs, denoted by V. In this 
way we are led to the sampling distribution of the statistic V. It can be shown that this sampling 
distribution has a mean and variance given, respectively, by the formulas 

2^N 2 , 1 2 2N { N 2 (2N { N 2 -Nj- N 2 ) 

Ul-'=— 7^+1 Oy — = (13) 

N { +N 2 ( Nl+Nl f( Nl+ N 2 -l) 1 ; 

By using formulas (73), we can test the hypothesis of randomness at appropriate levels of significance. It 
turns out that if both ATj and N 2 are at least equal to 8, then the sampling distribution of Vis very nearly 
a normal distribution. Thus 

Z = ^L {14) 



is normally distributed with mean and variance 1, and thus Appendix II can be used. 
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FURTHER APPLICATIONS OF THE RUNS TEST 

The following are other applications of the runs test to statistical problems: 

1 . Above- and Below-Median Test for Randomness of Numerical Data. To determine whether numer- 
ical data (such as collected in a sample) are random, first place the data in the same order in which 
they were collected. Then find the median of the data and replace each entry with the letter a or b 
according to whether its value is above or below the median. If a value is the same as the median, 
omit it from the sample. The sample is random or not according to whether the sequence of a's and 
b's is random or not. (See Problem 17.20.) 

2. Differences in Populations from Which Samples Are Drawn. Suppose that two samples of sizes m 
and n are denoted by a x , a 2 , . . . , a m and b x ,b 2 ,..., b„, respectively. To decide whether the samples 
do or do not come from the same population, first arrange all m + n sample values in a sequence of 
increasing values. If some values are the same, they should be ordered by a random process (such as 
by using random numbers). If the resulting sequence is random, we can conclude that the samples 
are not really different and thus come from the same population; if the sequence is not random, 
no such conclusion can be drawn. This test can provide an alternative to the Mann-Whitney U test. 
(See Problem 17.21.) 



SPEARMAN'S RANK CORRELATION 

Nonparametric methods can also be used to measure the correlation of two variables, X and Y. 
Instead of using precise values of the variables, or when such precision is unavailable, the data may be 
ranked from 1 to N in order of size, importance, etc. If X and Y are ranked in such a manner, the 
coefficient of rank correlation, or Spearman 's formula for rank correlation (as it is often called), is given by 

r s =l- 6 ^ 2 °\ (15) 
N(N 2 - 1) y ' 

where D denotes the differences between the ranks of corresponding values of X and Y, and where N is 
the number of pairs of values (X, Y) in the data. 



THE SIGN TEST 

17.1 Referring to Table 17.1, test the hypothesis H Q that there is no difference between machines I and 
II against the alternative hypothesis H x that there is a difference at the 0.05 significance level. 



Figure 17-1 shows the binomial distribution of the probabilities of X heads in 12 tosses of a coin as 
areas under rectangles at X=0, 1, ...,12. Superimposed on the binomial distribution is the normal 
distribution, shown as a dashed curve. The mean of the binomial distribution is /i = Np = 12(0.5) = 6. 
The standard deviation is a = yJNpq = ^/12(0.5)(0.5) = \f% = 1.73. The normal curve also has mean = 6 
and standard deviation = 1.73. From chapter 7 the binomial probability of X heads is 



Solved Problems 



SOLUTION 
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Fig. 17-1 Binomial distribution (ar 
(dashed curve) 



The probabilities may be found by using EXCEL. The />-value corresponding to the outcome X= 3 
is 2P{jT<3} which using EXCEL is =2*BIN0MDIST ( 3 , 12 , . 5 , 1) or 0.146. (The p-value is twice 
the area in the left tail of the binomial distribution.) Since this area exceeds 0.05, we are unable to reject 
the null at level of significance 0.05. We thus conclude that there is no difference in the two machines 
at this level. 

17.2 Work Problem 17.1 by using a normal approximation to the binomial distribution. 
SOLUTION 

For a normal approximation to the binomial distribution, we use the fact that the z score corresponding 
to the number of heads is 



X-ix 



X-Np 



Because the variable X for the binomial distribution is discrete while that for a normal distribution is 
continuous, we make a correct ion for continuity (for example, 3 heads are really a value between 2.5 and 
3.5 heads). This amounts to decreasing X by 0.5 if X > Np and to increasing X by 0.5 if X < Np. Now 
N = 12, n = Np = (12)(0.5) = 6, and a = -jWpq = v /(12)(0.5)(0.5) = 1.73, so that 



The p-value is twice the area to the left of -1.45. Using EXCEL, we evaluate =2*N0RMSDIST (-1.45) and 
get 0.147. Looking at Figure 17-1, the approximate p- value is the area under the standard normal curve from 
— 1 .45 to the left and it is doubled because the hypothesis is two-tailed. Notice how close the two p-values are 
to one another. The binomial area under the rectangles is 0.146 and the area under the approximating 
standard normal curve is 0.147. 



17.3 The PQR Company claims that the lifetime of a type of battery that it manufactures is more than 
250 hours (h). A consumer advocate wishing to determine whether the claim is justified measures 
the lifetimes of 24 of the company's batteries; the results are listed in Table 17.3. Assuming the 
sample to be random, determine whether the company's claim is justified at the 0.05 significance 
level. Work the problem first by hand, supplying all the details for the sign test. Follow this with 
the MINITAB solution to the problem. 
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SOLUTION 

Let H be the hypothesis that the company's batteries have a lifetime equal to 250 h, and let //j be 
the hypothesis that they have a lifetime greater than 250 h. To test H Q against H u we can use the sign test. 
To do this, we subtract 250 from each entry in Table 17.3 and record the signs of the differences, as shown in 
Table 17.4. We see that there are 15 plus signs and 9 minus signs. 



Table 17.3 Table 17.4 

271 230 198 275 282 225 284 219 +-- + + -+ - 

253 216 262 288 236 291 253 224 + - + + - + +- 

264 295 211 252 294 243 272 268 + + - + + -+ + 



Using the normal approximation to the binomial distribution, the z score is 
(15 -0.5) -24(0.5) = 1 Q2 
\/24(0.5)(0.5) 

Note the correction for continuity (15 - 0.5) = 14.5 when 0.5 is subtracted form 15. The p-value is the area 
to the right of 1.02 on the z-curve, also called the standard normal curve. (See Fig. 17-2.) 




The />-value is the area to the right of z = 1 .02 or, using EXCEL, the area is given by 
= 1-N0RMSDIST (1.02) or 0.1537. Since the p-value >0.05, the companies claim cannot be justified. 

The solution using MINITAB is as follows. The data in Table 17.3 are entered into column 1 of the 
M1NITAB worksheet and the column is named Lifetime. The pull-down Stat — > Non-parametrics — > 
1-sample sign gives the following output. 

Sign Test for Median: Lifetime 

Sign test of median = 250 . versus > 250.0 

N Below Equal Above P Median 

Lifetime 24 9 15 0.1537 257.5 

Note that the information given here is the same that we arrived at earlier in the solution of this 
problem. 
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17.4 A sample of 40 grades from a statewide examination is shown in Table 17.5. Test the hypothesis 
at the 0.05 significance level that the median grade for all participants is (a) 66 and (b) 75. 
Work the problem first by hand, supplying all the details for the sign test. Follow this with the 
MINITAB solution to the problem. 

Table 17.5 




SOLUTION 

(a) Subtracting 66 from all the entries of Table 17.5 and retaining only the associated signs gives us Table 
17.6, in which we see that there are 23 pluses, 15 minuses, and 2 zeros. Discarding the 2 zeros, our 
sample consists of 38 signs: 23 pluses and 15 minuses. Using a two-tailed test of the normal distribution 
with probabilities i(0.05) = 0.025 in each tail (Fig. 17-3), we adopt the following decision rule: 
Accept the hypothesis if -1.96 < z < 1.96. 
Reject the hypothesis otherwise. 



_ X - Np _ (23 - 0.5) - (38)(0.5) _ 
" VWq ~ V(38)(0.5)(0.5) ~" 



1.14 



e accept the hypothesis that the median is 66 at the 0.05 level. 

Table 17.6 
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(b) Subtracting 75 from all the entries in Table 17.5 gives us Table 17.7, in which there are 13 pluses and 
27 minuses. Since 



_ (13 + 0.5)-(40)(0.5) = _ 9 
V(40)(0.5)(0.5) 

we reject the hypothesis that the median is 75 at the 0.05 level. 

Table 17.7 



Using this method, we can arrive at a 95% confidence interval for the median grade on the 
examination. (See Problem 17.30.) 

Note that the above solution uses the classical method of testing hypothesis. The classical method uses 
a = 0.05 to find the rejection region for the test, [z < -1.96 or z > 1.96]. Next, the test statistic is computed 
[z = —1.14 in part (a) and z = -2.06 in part (b)]. If the test statistic falls in the rejection region, then reject 
the null. If the test statistic does not fall in the rejection region, do not reject the null. 

The MINITAB solution uses the p-value approach. Calculate the p-value and, if it is <0.05, reject the 
null hypothesis. If the p-value >0.05, do not reject the null hypothesis. The same decision will always be 
reached using either the classical method or the p-value method. 

The solution using MINITAB proceeds as follows. To test that the median equals 66, the output is 

Sign Test for Median: Grade 

Sign test of median = 66. 00 versus not = 66 . 00 

N Below Equal Above P Median 

Grade 40 15 2 23 0.2559 70.00 

Since the p-value >0.05, do not reject the null hypothesis. 
To test that the median equals 75, the output is 

Sign Test for Median: Grade 

Sign test of median = 75. 00 versus not = 75 . 00 

N Below Equal Above P Median 

Grade 40 27 13 0.0385 70.00 

Since the /)-value<0.05, reject the null hypothesis. 



THE MANN-WHITNEY U TEST 

17.5 Referring to Table 17.2, determine whether there is a difference at the 0.05 significance level 
between cables made of alloy I and alloy II. Work the problem first by hand, supplying all the 
details for the Mann-Witney U test. Follow this with the MINITAB solution to the problem. 

SOLUTION 

We organize the work in accordance with steps 1, 2, and 3 (described earlier in this chapter): 
Step 1. Combining all 18 sample values in an array from the smallest to the largest gives us the first 
line of Table 17.8. These values are numbered 1 to 18 in the second line, which gives us the ranks. 
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10.7 11.8 12.6 12.9 14.1 14.7 15.2 15.9 16.1 16.4 17.8 18.3 18.9 19.6 20.5 22.7 24.2 25.3 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 



Step 2. To find the sum of the ranks for each sample, rewrite Table 17.2 by using the associated ranks 
from Table 17.8; this gives us Table 17.9. The sum of the ranks is 106 for alloy I and 65 for alloy II. 



Alloy I 


Alloy II 


Cable 




Cable 




Strength 


Rank 


Strength 


Rank 


18.3 


12 


12.6 


3 


16.4 


10 


14.1 


5 


22.7 


16 


20.5 


15 


17.8 


11 


10.7 




18.9 


13 


15.9 


8 


25.3 


18 


19.6 


14 


16.1 


9 


12.9 


4 


24.2 


17 


15.2 


7 




Sum 106 


11.8 


2 






14.7 


6 








Sum 65 



Step 3. Since the alloy I sample has the smaller size, N x -- 
the ranks are R x = 106 and R 2 = 65. Then 



U = N ] N 2 + - 



10. The corresponding sums of 

(8)(9) 



yl '^ lJ -R t =(8)(10)+^-106=10 
_N i N 2 (N ] +N 2 + l) _(&)(W)m , 7 



Thus a v = 11.25 and 



_ U - fly _ 10 - 40 _ 



Since the hypothesis H that we are testing is whether there is no difference between the alloys, a two-tailed 
test is required. For the 0.05 significance level, we have the decision rule: 

Accept H if -1.96 < z < 1.96. 
Reject H otherwise. 

Because z = -2.67, we reject H and conclude that there is a difference between the alloys at the 0.05 level. 

The MINITAB solution to the problem is as follows: First, the data values for alloy I and alloy II are 
entered into columns CI and C2, respectively, and the columns are named Alloyl and Alloyll. The 
pull-down Stat — ► Nonparametrics — » Mann-Whitney gives the following output. 



Mann-Whitney Test and CI: Alloyl, Alloyll 

N Median 
Alloyl 8 18.600 

Alloyll 10 14.400 
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Point estimate for ETA1-ETA2 is 4.800 

95.4 Percent CI for ETA1-ETA2 is (2.000, 9.401) 

W = 106.0 

Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at . 0088 

The MINITAB output gives the median strength of the cables for each sample, a point estimate of the 
difference in population medians (ETA1-ETA2), a confidence interval for the difference in population 
medians, the sum of the ranks for the first named variable (W= 106), and the two-tailed ^-value = 0.0088. 
Since the p-value <0.05, the null hypothesis would be rejected. We would conclude that alloy 1 results in 
stronger cables. 



17.6 Verify results (<5) and (7) of this chapter for the data of Problem 17.5. 
SOLUTION 

(a) Since samples 1 and 2 yield values for U given by 

U X = N,N 2 + N ^± 1 1 - *, = (8)(10) + - 106 = 10 

we have U Y + U 2 = 10 + 70 = 80, and N X N 2 = (8)(10) = 80. 

(b) Since R x = 106 and R 2 = 65, we have R, + R 2 = 106 + 65 = 171 and 

N(N+ 1) _ (JV, + NjjN, +N 2 + 1) _ (18)(19) _ ^, 



17.7 Work Problem 17.5 by using the statistic U for the alloy II sample. 
SOLUTION 

For the alloy II sample, 

U = N l N 2+ N ^ 2 + V-R 2 = m X0) + ^-65 = 70 

sothat z= f ^ = 7 ^° = 2.67 

a v 11.25 

This value of z is the negative of the z in Problem 17.5, and the right-hand tail of the normal distribution is 
used instead of the left-hand tail. Since this value of z also lies outside -1.96 < z < 1.96, the conclusion is the 
same as that for Problem 17.5. 



17.8 A professor has two classes in psychology: a morning class of 9 students, and an afternoon class 
of 12 students. On a final examination scheduled at the same time for all students, the classes 
received the grades shown in Table 17.10. Can one conclude at the 0.05 significance level that the 
morning class performed worse than the afternoon class? Work the problem first by hand, 
supplying all the details for the Mann-Whitney U test. Follow this with the MINITAB solution 
to the problem. 



Table 17.10 



Morning class 


73 87 79 75 82 66 95 75 70 


Afternoon class 


86 81 84 88 90 85 84 92 83 91 53 84 
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Step 1. Table 17.11 shows the array of grades and ranks. Note that the rank for the two grades of 75 is 
i(5 + 6) = 5.5, while the rank for the three grades of 84 is i(ll + 12 + 13) = 12. 



53 66 70 73 75 75 79 81 82 83 84 
.5 7 8 9 10 



84 85 86 87 88 90 91 92 95 
14 15 16 17 18 19 20 21 



Step 2. Rewriting Table 17.10 in terms of ranks gives us Table 17.12. 

Check: R x = 73, R 2 = 158, and N = JV, + N 2 = 9 + 12 = 21; thus R { + R 2 = 73 + 158 = 231 and 





Sum of 
Ranks 


Morning class 


4 16 7 5.5 9 2 21 5.5 3 


73 


Afternoon class 


15 8 12 17 18 14 12 20 10 19 1 12 


158 



U = N l N 2 +- 
_N,N 2 _(9)(12) _ ka 



_ NjN 2 (Ni + N 2 + 1) _ (9) (12) (22) _ 



Thus z = ^^ = ^— — = 1.85 

a v 14.07 

The one-tailed p-value is given by the EXCEL expression =1-N0RMSDIST ( 1 . 85 ) which yields 0.0322. 
Since the /j-value <0.05, we can conclude that the morning class performs worse than the afternoon class. 

The MINITAB solution to the problem is as follows. First the data values for the morning and after- 
noon classes are entered into columns CI and C2 and the columns are named morning and afternoon. 
The pull-down Stat — > Nonparametrics — » Mann-Whitney gives the following output. 



Mann-Whitney Test and CI: Morning, Afternoon 

N Median 
Morning 9 75.00 

Afternoon 12 84.50 

Point estimate for ETA1-ETA2 is -9.00 

95.7 Percent CI for ETA1-ETA2 is (-15.00, 2.00) 

W = 73.0 

Test of ETA1 = ETA2 vs ETA1 < ETA2 is significant at 0.0350 
The test is significant at 0.0348 (adjusted for ties) 

The MINITAB output gives the median grade for each sample, a point estimate of the difference in 
population medians, a confidence interval for the difference in population medians, the sum of ranks for 
the first named variable (in this case, the morning class), and the one-tail p-value = 0.0350. Since the p-value 
is less than 0.05, the null hypothesis would be rejected. We conclude that the morning class performs worse 
than the afternoon class. 
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17.9 Find U for the data of Table 17.13 by using (a) formula (2) of this chapter and (b) the counting 
method (as described in Remark 4 of this chapter). 

SOLUTION 

(a) Arranging the data from both samples in an array in increasing order of magnitude and assigning ranks 
from 1 to 5 gives us Table 17.14. Replacing the data of Table 17.13 with the corresponding ranks gives 
us Table 17.15, from which the sums of the ranks are R t = 5 and R 2 = 10. Since N x =2 and N 2 = 3, 
the value of U for sample 1 is 

U = N l N 2 + N ^ + V -R ] = { 2 ){ 3 )+ ^-5 = 4 

The value of U for sample 2 can be found similarly to be U = 2. 



Sample 1 


22 10 


Sample 2 


17 25 14 



Table 17.15 





Sum of 


Sample 1 


4 1 


5 






10 



(b) Let us replace the sample values in Table 17.14 with I or II, depending on whether the value belongs to 
sample 1 or 2. Then the first line of Table 17.14 becomes 



I II II I 



Number of sample 1 values preceding first sample 2 value = 1 
Number of sample 1 values preceding second sample 2 value = 1 
Number of sample 1 values preceding third sample 2 value = 2 
Total = 4 

Thus the value of U corresponding to the first sample is 4. 
Similarly, we have 

Number of sample 2 values preceding first sample 1 value = 
Number of sample 2 values preceding second sample 1 value = 2 
Total = 2 



Thus the value of U corresponding to the second sample is 2. 

Note that since N x =2 and N 2 = 3, these values satisfy U l + U 2 = N l N 2 ; that is, 4 + 2 = 

(2)(3) = 6. 
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17.10 A population consists of the values 7, 12, and 15. Two samples are drawn without replacement 
from this population: sample 1, consisting of one value, and sample 2, consisting of two values. 
(Between them, the two samples exhaust the population.) 

(a) Find the sampling distribution of U and its graph. 

(b) Find the mean and variance of the distribution in part (a). 

(c) Verify the results found in part (b) by using formulas (3) of this chapter. 

SOLUTION 

(a) We choose sampling without replacement to avoid ties — which would occur if, for example, the value 
12 were to appear in both samples. 

There are 3 • 2 - (> possibilities for choosing the samples, as shown in Table 17.16. It should 
be noted that we could just as easily use ranks 1, 2, and 3 instead of 7, 12, and 15. The value U in 
Table 17.16 is that found for sample 1, but if U for sample 2 were used, the distribution would be 
the same. 



Table 17.16 



Sample 1 


Sample 2 


U 


7 


12 15 


2 


7 


15 12 


2 


12 


7 15 


1 


12 


15 7 


1 


15 


7 12 





15 


12 7 






A graph of this distribution is shown in Fig. 17-4, where / is the frequency. The probability 
distribution of U can also be graphed; in this case Pr{[7 = 0} = Pr{U = 1} = Pr{U = 2} = i. The 
required graph is the same as that shown in Fig. 17-4, but with ordinates 1 and 2 replaced by \ and 
5, respectively. 



2.50 



2.25 



*- 2.00 - • • • 



1.75 



1.50- 

0.0 0.5 1.0 1.5 2.0 

Fig. 17-4 MINITAB plot of the sampling distribution for U and JV, = 1 and N 2 = 2. 
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(b) The mean and 



found from Table 17.16 are given by 
2 + 2 + 1 + 1+0+0 



= 1 



2 = (2 - 1)- + (2 - I) 1 + (1 - l) 2 + (1 - I) 1 + (0 - l) 2 + (0 - If _ 2 
v 6 3 

(c) By formulas (3), 



iVW 2 _ (1)(2) 
2 2 



= 1 



2 N X N 2 {N X + N 2 + 1) (1)(2)(1 +2+ 1) 2 

^ = 12 = 12 = 3 

showing agreement with part (a). 

17.11 (a) Find the sampling distribution of U in Problem 17.9 and graph it. 

(b) Graph the corresponding probability distribution of U. 

(c) Obtain the mean and variance of U directly from the results of part (a). 

(d) Verify part (c) by using formulas (3) of this chapter. 

SOLUTION 

(a) In this case there are 5 ■ 4 ■ 3 ■ 2 = 120 possibilities for choosing values for the two samples and the 
method of Problem 17.9 is too laborious. To simplify the procedure, let us concentrate on the smaller 
sample (of size JV^ = 2) and the possible sums of the ranks, J?] . The sum of the ranks for sample 1 is the 
smallest when the sample consists of the two lowest-ranking numbers (1,2); then /?, = 1 + 2 = 3. 
Similarly, the sum of the ranks for sample 1 is the largest when the sample consists of the two 
highest-ranking numbers (4,5); then R { =4+5 = 9. Thus R u varies from 3 to 9. 

Column 1 of Table 17.17 lists these values of Ri (from 3 to 9), and column 2 shows the corre- 
sponding sample 1 values, whose sum is Rj. Column 3 gives the frequency (or number) of samples with 
sum Ri; for example, there are / = 2 samples with Ri = 5. Since N\ = 2 and N 2 = 3, we have 



U = N,N 2 + 



Ni{Nj + 1) 



J2)(3) 



From this we find the corresponding values of U in column 4 of the table; note that as R { varies from 3 
to 9, U varies from 6 to 0. The sampling distribution is provided by columns 3 and 4, and the graph is 
shown in Fig. 17-5. 



Fig. 17-5 EXCEL plot of the sampling distribution for U and iV, = 2 and N 2 = 3. 
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(b) The probability that U = R A (i.e., Pr{{7 = R x }) is shown in column 5 of Table 17.17 and is obtained by 
finding the relative frequency. The relative frequency is found by dividing each frequency/by the sum 
of all the frequencies, or 10; for example, Pr{[7 = 5} = = 0.2. The graph of the probability distribu- 
tion is shown in Fig. 17-6. 





Sample 1 Values 


/ 


U 


Pr{t/ = J?,} 


3 


(1,2) 


1 


6 


0.1 


4 


(1,3) 


1 


5 


0.1 


5 


(1, 4), (2, 3) 




4 


0.2 


6 


(1, 5), (2, 4) 




3 


0.2 


7 


(2, 5), (3, 4) 




2 


0.2 


8 


(3, 5) 


1 


1 


0.1 


9 


(4, 5) 


1 





0.1 



0.00 1.00 2.00 3.00 4.00 5.00 6.00 
U 

Fig. 17-6 SPSS plot of the probability distribution for U with N x =2 and N 2 = 3. 
(c) From columns 3 and 4 of Table 17.17 we have 

u = u = L W = CX 6 ) + W( 5 ) + ( 2 )W + ( 2 )( 3 ) + ( 2 X 2 ) + 0)0) + WW = 3 

' ° E / 1+1+2+2+2+1+1 



E/(t/-E7f 



_ (l)(6-3)^ + (l)(5-3) 2 + (2)(4-3) / + (2)(3-3) 2 + (2)(2-3) / + (l)(l-3) 2 +(l)(0-3)^ 



Another method 



462 



NONPARAMETRIC TESTS 



[CHAP. 17 



(d) By formulas (3), using N x =2 and N 2 = 3, we have 

_ _ (2X3) 2 _ jV 1 iV 2 (jV 1 +iV 2 + l) _ (2)(3)(6) _ 



17.12 If TV numbers in a set are ranked from 1 to N, prove that the sum of the ranks is [N(N + l)]/2. 
SOLUTION 

Let be the sum of the ranks. Then we have 

R=l + 2 + 3 + --- + (N-l)+N (16) 

R = N+(N-\) + (N -!) + ■■■+ 2 +1 (17) 

where the sum in equation (17) is obtained by writing the sum in (16) backward. Adding equations (16) and 
(17) gives 

2R = (N + 1) + (TV + 1) + (N + 1) + • • • + (N + 1) + (TV + 1) = N(N + 1) 

since (N + 1) occurs AT times in the sum; thus ^ = [iV(JV + l)]/2. This can also be obtained by using a result 
from elementary algebra on arithmetic progressions and series. 

17.13 If i?i and R 2 are the respective sums of the ranks for samples 1 and 2 in the U test, prove that 
Ri +R 2 - [N(N+ l)]/2. 

SOLUTION 

We assume that there are no ties in the sample data. Then R t must be the sum of some of the ranks 
(numbers) in the set 1, 2, 3, . . . , N, while R 2 must be the sum of the remaining ranks in the set. Thus the sum 

R t + R 2 must be the sum of all the ranks in the set; that is, R, + R 2 = 1 + 2 + 3 H h N = [N(N + l)]/2 

by Problem 17.12. 



THE KRUSKAL-WALLIS H TEST 

17.14 A company wishes to purchase one of five different machines: A, B, C, D, or E. In an experiment 
designed to determine whether there is a performance difference between the machines, five 
experienced operators each work on the machines for equal times. Table 17.18 shows the number 
of units produced by each machine. Test the hypothesis that there is no difference between the 
machines at the (a) 0.05 and (b) 0.01 significance levels. Work the problem first by hand, supplying 
all the details for the Kruskal-Wallis H test. Follow this with the MINITAB solution to the 
problem. 



12 6.5 2.5 



16 19 17.5 6.5 
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Since there are five samples (A, B, C, D, and E ), k = 5. And since each sample consists of five values, we 
have Ni = N 2 = N 3 = N 4 = N 5 = 5, and N = + N 2 + N 3 + N 4 + N 5 = 25. By arranging all the values in 
increasing order of magnitude and assigning appropriate ranks to the ties, we replace Table 17.18 with Table 



17.19, the right-hand column of which shows the 
R 2 = 48.5, R 3 = 93, R 4 = 40.5, and R 5 = 73. Thus 



i of the ranks. We see from Table 17.19 that R 1 = 70, 



12 f(70) 2 (48.5) 2 (93) 2 (40) 2 (73) 2 1 
" (25)(26) [ 5 + 5 + 5 + 5 + 5 \ i[2b) ~ & ' 44 

For k — 1 = 4 degrees of freedom at the 0.05 significance level, from Appendix IV we have \^ = 9.49. Since 
6.44 < 9.49, we cannot reject the hypothesis of no difference between the machines at the 0.05 level and 
therefore certainly cannot reject it at the 0.01 level. In other words, we can accept the hypothesis (or reserve 
judgment) that there is no difference between the machines at both levels. 

Note that we have already worked this problem by using analysis of variance (see Problem 16.8) and 
have arrived at the same conclusion. 

The solution to the problem by using MINITAB proceeds as follows. First the data need to be entered 
in the worksheet in stacked form. The data structure is as follows: 



Row 



Machii 



Unii 



The pull-down Stat — ► Nonparametrics — ► Kruskal-Wallis gives the following output: 
Kruskal-Wallis Test: Units versus Machine 



68.00 
53.00 
72.00 
57.00 
65.00 
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H = 6.44 DF = 4 P = 0.168 

H = 6.49 DF = 4 P = . 165 ( adjusted for ties ) 

Note that two p-values are provided. One is adjusted for ties that occur in the ranking procedure and 
one is not. It is clear that the machines are not statistically different with respect to the number of units 
produced at either the 0.05 or the 0.01 levels since either p- value far exceeds both levels. 



17.15 Work Problem 17.14 using the software package STAT1STIX and the software package SPSS. 
SOLUTION 

The data file is structured as in problem 17.14. The STATISTIX pull-down Statistics One, Two, 
Multi-sample tests — ► Kruskal-Wallis One-way AOV Test gives the following output. Compare this with the 
output in Problem 17.14. 



Kruskal-Wallis One-Way Nonparametric AOV for Units by Machine 





Mean 


Sample 


Machine 


Rank 


Size 


1 


14.0 


5 


2 


9.7 


5 


3 


18.6 


5 


4 


8.1 


5 


5 


14.6 


5 


Total 


13.0 


25 


Kruskal-Wallis 


Statistic 





p-Value, Using Chi-Squared Approximate 



6.4949 
. 1651 



The STATISTIX output compares with the MINITAB output adjusted for ties. 
The SPSS output also compares with the MINITAB output adjusted for ties. 



Kruskal-Wallis Test 



1.00 
2.00 
3.00 



14.00 
9.70 
18.60 



5.00 
Total 



Test Statistics 3 " 





Units 


Chi-Square 
df 

Asymp. Sig. 


6.495 
4 

.165 



a Kruskal Wallis Test 
b Grouping Variable: Machine 
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17.16 Table 1 7.20 gives the number of DVD's rented during the past year for random samples of teachers, 
lawyers, and physicians. Use the Kruskal-Wallis H test procedure in SAS to test the null hypothesis 
that the distributions of rentals are the same for the three professions. Test at the 0.01 level. 



Table 17.20 



Teachers 


Lawyers 


Physicians 


18 


2 


14 


4 


16 


30 


5 


21 


11 


9 


24 


1 


20 


5 


7 


26 


2 


5 


7 


50 


14 


17 


10 


7 


43 


7 


16 


20 


49 


14 


24 


35 


27 


7 


1 


19 


34 


45 


15 


30 


6 


22 


45 


9 


20 


2 


24 


10 


45 


36 




9 


50 






44 






3 





The data are entered in two columns. One column contains 1, 2, or 3 for teacher, lawyer, and physician. 
The other column contains the number of DVD's rented. The SAS pull-down Statistics — ► ANOVA — > 
Nonparametric Oneway ANOVA gives the following output. 

The NPAR1WAY Procedure 

Wilcoxon Scores (Rank Sums) for Variable number 
Classified by Variable Profession 

Sum of Expected Std Dev Mean 

Profession N Scores Under HO Under HO Score 

ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff 

1 18 520.50 495.0 54.442630 29.916667 

2 20 574.50 550.0 55.770695 28.725000 

3 16 390.00 440.0 52.735539 24.375000 

Average scores were used for ties. 

Kruskal-Wallis Test 

Chi-Square 0.9004 
DF 2 
Pr > Chi-Sqaure 0.6375 
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The /rvalue is indicated as Pr > Chi-Square . 6375. Since the p-value is much greater than 0.05, do 
not reject the null hypothesis. 



THE RUNS TEST FOR RANDOMNESS 

17.17 In 30 tosses of a coin the following sequence of heads (H) and tails (T) is obtained: 
HTTHTHHHTHHTTHT 
HTHHTHTTHTHHTHT 

(a) Determine the number of runs, V. 

(b) Test at the 0.05 significance level whether the sequence is random. 

Work the problem first by hand, supplying all the details of the runs test for randomness. Follow 
this with the MINITAB solution to the problem. 

SOLUTION 

(a) Using a vertical bar to indicate a run, we see from 

H | T T | H | T | H H H|T|H H | T T|H|T| 
H | T | H H | T | H | T T|H|T|H H|T|H|T| 
that the number of runs is V = 22. 

(b) There are jVj = 16 heads and N 2 = 14 tails in the given sample of tosses, and from part (a) the number 
of runs is V = 22. Thus from formulas (13) of this chapter we have 

fiy = ?P^ + 1 = 15.93 a\ = 2(16)(14)[2(16)(14)-16-14] = 

16+14 (16+ 14) 2 (16+ 14- 1) 

or a v = 2.679. The z score corresponding to V = 22 runs is therefore 
F — _ 22 — 15.93 _ 27 
a v 2.679 

Now for a two-tailed test at the 0.05 significance level, we would accept the hypothesis H of random- 
ness if — 1.96 < z < 1.96 and would reject it otherwise (see Fig. 17-7). Since the calculated value of z is 
2.27 > 1.96, we conclude that the tosses are not random at the 0.05 level. The test shows that there are 
too many runs, indicating a cyclic pattern in the tosses. 
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If the correction for continuity is used, the above z score is replaced by 
(22-0.5)- 15.93 

Z = 1679 = 2 -° 8 

and the same conclusion is reached. 

The solution to the problem by using MINITAB proceeds as follows. The data are entered into 
column CI as follows. Each head is represented by the number 1 and each tail is represented by the 
number 0. Column 1 is named Coin. The MINITAB pull-down Stat — ► Nonparametrics — ► Runs Test 
gives the following output. 

Runs Test: Coin 

Runs test for Coin 

Runs above and below K = 0. 533333 

The observed number of runs = 22 
The expected number of runs = 15.9333 
16 Observations above K 14 below 
p-value = 0.024 

The value shown for K is the mean of the zeros and ones in column 1 . The number of observations 
above and below K will be the number of heads and tails in the 30 tosses of the coin. The p-value is 
equal to 0.0235. Since this p-value is less than 0.05, we reject the null hypothesis. The number of runs is 
not random. 



17.18 A sample of 48 tools produced by a machine shows the following sequence of good (G) and 
defective (D) tools: 

GGGGGGDDGGGGGGGG 
GGDDDDGGGGGGDGGG 
GGGGGGDDGGGGGDGG 

Test the randomness of the sequence at the 0.05 significance level. Also use SPSS to test the 
randomness of the sequence. 



The numbers of D's and G's are TV) = 10 and N 2 = 38, respectively, and the number of runs is V = 11. 
Thus the mean and variance are given by 

^ = ^#+1 = 16.83 4 ^(10)(38)[2(10)(38)- 10- 38]^ 997 
10 + 38 (10 + 38) 2 (10 + 38 - 1) 

so that a v = 2.235. 

For a two-tailed test at the 0.05 level, we would accept the hypothesis H of randomness 
if -1.96<z< 1.96 (see Fig. 17-7) and would reject it otherwise. Since the z score corresponding to 
V = 11 is 

V~W H-16.83 
Z = ^ K - = ^235- = " 2 - 61 

and -2.61 < -1.96, we can reject H at the 0.05 level. 

The test shows that there are too few runs, indicating a clustering (or bunching) of defective tools. In 
other words, there seems to be a trend pattern in the production of defective tools. Further examination of 
the production process is warranted. 
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The pull-down Analyze — > Nonparametric Tests — > Runs gives the following SPSS output. If the G's are 

replaced by l's and the D's are replaced by O's, the test value, 0.7917, is the mean of these values. The other 
values are iVi = 10, JV 2 = 38, Sum = 48, and V=ll. The computed value of z has the correction for 
continuity in it. This causes the z value to differ from the z value computed above. The term 
Asymp. Sig. (2-tailed) is the two-tailed /--value corresponding to z = -2.386. We see that the SPSS 
output has the same information as we computed by hand. We simply have to know how to interpret it. 



Runs Test 





Quality 


Test Value 3 


.7917 


Cases < Test Value 


10 


Cases > = Test Value 


38 


Total Cases 


48 


Number of Runs 


11 


Z 


-2.386 


Asymp. Sig. (2-tailed) 


.017 



a Mean 



17.19 (a) Form all possible sequences consisting of three a's and two i's and give the numbers of runs, 
V, corresponding to each sequence. 

(b) Obtain the sampling distribution of V and its graph. 

(c) Obtain the probability distribution of V and its graph. 

SOLUTION 

(a) The number of possible sequences consisting of three a's and two b's is 

These sequences are shown in Table 17.21, along with the number of runs corresponding to each 
sequence. 

(b) The sampling distribution of Fis given in Table 17.22 (obtained from Table 17.21), where K denotes the 
number of runs and / denotes the frequency. For example, Table 17.22 shows that there is one 5, 
four 4's, etc. The corresponding graph is shown in Fig. 17-8. 



Sequence 


Runs (V) 






b 


b 


2 




a b 




b 


4 




a b 


b 




3 




b a 


b 




5 




b b 






3 




b a 




b 


4 


b 


b a 








b 


a b 






4 


b 


a a 




b 


3 


b 


a a 


b 




4 



V 


/ 






3 
4 

5 


3 
4 
1 
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(c) The probability distribution of V, graphed in Fig. 17-9, is obtained from Table 17.22 by dividing each 
frequency by the total frequency 2 + 3 + 4+1 = 10. For example, Pr{ V = 5} = ^ = 0.1. 



17.20 Find (a) the mean and (b) the variance of the number of runs in Problem 17.19 directly from the 
results obtained there. 

Scatter Plot of fvs V 



Fig. 17-8 STATIST1X graph of the sampling distribution of V. 



Fig. 17-9 EXCEL graph of the p 



>ility distribution of V. 



SOLUTION 

(a) From Table 17.21 we have 



2+4+3+5+3+4+2+4+3+4 17 



Another method 

From Table 17.21 the grouped-data method gives 

>:/' (2)(2) + (3)(3) + (4)(4) + (1)(5) = 17 
^ £/ 2+3+4+1 5 
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(b) Using the grouped-data method for computing the variance, from Table 17.22 we have 

H^)H ? -t)^-t)'+<'>K)1 

Another method 

As in Chapter 3, the variance is given by 



a 2 = V2 p2 _ (2)(2) 2 + (3)(3) 2 + (4)(4) 2 + (1)(5)' 



17.21 Work Problem 17.20 by using formulas (73) of this chapter. 
SOLUTION 

Since there are three a's and two b's, we have Aj = 3 and A 2 = 2. Thus 
2N l N 2 , 2(3)(2) , 17 
Mf = + = 3 + 2 + T 

2 _ 2A 1 A 2 (2A 1 A 2 — A] — N 2 ) _ 2(3)(2)[2(3)(2) - 3 - 2] _ 21 
aV ~ (Af 1 +A 2 ) 2 (A 1 +A 2 -1) ~ (3 + 2) 2 (3 + 2-l) ~ 25 



FURTHER APPLICATIONS OF THE RUNS TEST 

17.22 Referring to Problem 17.3, and assuming a significance level of 0.05, determine whether the 
sample lifetimes of the batteries produced by the PQR Company are random. Assume the life- 
times of the batteries given in Table 17.3 were recorded in a row by row fashion. That is, the first 
lifetime was 217, the second lifetime was 230, and so forth until the last lifetime, 268. Work the 
problem first by hand, supplying all the details of the runs test for randomness. Follow this with 
the STATISTIX solution to the problem. 

SOLUTION 

Table 17.23 shows the batteries' lifetimes in increasing order of magnitude. Since there are 24 entries in 
the table, the median is obtained from the middle two entries, 253 and 262, as \ (253 + 262) = 257.5. 
Rewriting the data of Table 17.3 by using an a if the entry is above the median and a b if it is below the 
median, we obtain Table 17.24, in which we have 12 a's, 12 b's, and 15 runs. Thus A, = 12, A 2 = 12, A = 24, 
V = 15, and we have 

,^#^ + l^ 2 ^ ) + 1^13 4 ^2X12X264) ^ 
A,+AT 2 12+ 12 (24) 2 (23) 



Table 17.23 Table 17.24 

198 211 216 219 224 225 230 236 abbaabab 

243 252 253 253 262 264 268 271 bbaababb 

272 275 282 284 288 291 294 295 aabbabaa 
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Using a two-tailed test at the 0.05 significance level, we would accept the hypothesis of randomness 
if — 1.96 < z < 1.96. Since 0.835 falls within this range, we conclude that the sample is random. 

The STATISTIX analysis proceeds as follows. The lifetimes are entered into column 1 in the order in 
which they were collected. The column is named Lifetime. The pull-down Statistics — » Randomness/ 
Normality Tests — > Runs Test gives the following output. 

Statistix 8.0 



Runs Test for Lifetimes 



Median 


257.50 




Values Above the Median 


12 




Values below the Median 


12 




Values Tied with the Median 







Runs Above the Median 


8 




Runs Below the Median 


7 




Total Number of Runs 


15 




Expected Number of Runs 


13.0 




p-Value, Two-Tailed Test 




0.5264 


Probability of getting 15 or fc 




0.8496 


Probability of getting 14 or mi 




0.2632 



The large />-value indicates that the number of runs may be regarded as being random. 



17.23 Work Problem 17.5 by using the runs test for randomness. 
SOLUTION 

The arrangement of all values from both samples already appears in line 1 of Table 17.8. Using the 
symbols a and b for the data from samples I and II, respectively, the arrangement becomes 

bbbbbbbbaaaaabbaaa 

Since there are four runs, we have V = 4, TAT, = 8, and N 2 = 10. Then 

a2 _ 2N X N 2 (2N,N 2 - JV, - N 2 ) 2(8) ( 10) ( 142) _ ^ ^ 
V {Ni+N 2 ) 2 (Ni+N 2 -l) (18) 2 (17) 



a v 2.031 

If H is the hypothesis that there is no difference between the alloys, it is also the hypothesis that 
the above sequence is random. We would accept this hypothesis if — 1 .96 < z < 1 .96 and would reject 
it otherwise. Since z = —2.90 lies outside this interval, we reject H and reach the same conclusion as for 
Problem 17.5. 

Note that if a correction is made for continuity, 

F- MF _ (4 + 0.5)- 9.889 _ 
Z ~ a v ~ 2.031 ~ ^ 



and we reach the same conclusion. 
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RANK CORRELATION 



17.24 Table 17.25 shows how 10 students, arranged in alphabetical order, were ranked according to 
their achievements in both the laboratory and lecture sections of a biology course. Find the 
coefficient of rank correlation. Use SPSS to compute Spearman's rank correlation. 



Table 17.25 



Laboratory 


8 3 9 2 7 10 4 6 1 5 


Lecture 


9 5 10 1 8 7 3 4 2 6 



The difference in ranks, D, in the laboratory and lecture sections for each student is given in Table 
17.26, which also gives D 2 and J2 D 2 - Thus 

— -1^5-'-^-^ 

indicating that there is a marked relationship between the achievements in the course's laboratory and 
lecture sections. 



Table 17.26 



Difference of ranks (D) 


-1 -2 -1 


-1 3 


2 -1 -1 




D 2 


1 4 1 


1 9 


4 1 1 


£ D 2 = 24 | 



Enter the data from Table 17.26 in columns named Lab and Lecture. The pull-down Analyze — > 
Correlate — > Bivariate results in the following output. 



Correlations 







Lab 


Lecture 


Spearman's rho 


Lab Correlation Coefficient 


1.000 


.855 




Sig. (2-tailed) 




.002 




N 


10 


10 




Lecture Correlation Coefficient 


.855 


1.000 




Sig. (2-tailed) 


.002 






N 


10 


10 



The output indicates a significant correlation between performance in Lecture and Lab. 



17.25 Table 17.27 shows the heights of a sample of 12 fathers and their oldest adult sons. 
Find the coefficient of rank correlation. Work the problem first by hand, supplying all the 
details in finding the coefficient of rank correlation. Follow this with the SAS solution to the 
problem. 
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Table 17.27 



Height of father (inches) 


65 63 67 64 68 62 70 66 68 67 69 71 


Height of son (inches) 


68 66 68 65 69 66 68 65 71 67 68 70 



SOLUTION 

Arranged in ascending order of magnitude, the fathers' heights are 

62 63 64 65 66 67 67 68 68 69 71 (18) 

Since the sixth and seventh places in this array represent the same height [67 inches (in)], we assign a mean 
rank \ (6 + 7) = 6.5 to these places. Similarly, the eighth and ninth places are assigned the rank 
j(8 + 9) = 8.5. Thus the fathers' heights are assigned the ranks 

1 2 3 4 5 6.5 6.5 8.5 8.5 10 11 12 (19) 

Similarly, arranged in ascending order of magnitude, the sons' heights are 

65 65 66 66 67 68 68 68 68 69 70 71 (20) 

and since the sixth, seventh, eighth, and ninth places represent the same height (68 in), we assign the mean 
rank ± (6 + 7 + 8 + 9) = 7.5 to these places. Thus the sons' heights are assigned the ranks 

1.5 1.5 3.5 3.5 5 7.5 7.5 7.5 7.5 10 11 12 (21) 

Using the correspondences (18) and (19), and (20) and (21), we can replace Table 17.27 with 
Table 17.28. Table 17.29 shows the difference in ranks, D, and the computations of D 2 and J2 1)2 > whereby 







. , 6J2D 2 , 6(72.50) nnACK 
ls=l N(N 2 -Yr X 12(12^- 1} = 0.7465 










Table 17.28 








Rank of father 


4 2 6.5 3 8.5 1 11 5 8.5 6.5 


10 12 






Rank of son 


7.5 3.5 7.5 1.5 10 3.5 7.5 1.5 12 5 


7.5 11 








Table 17.29 






D 


-3.5 -1.5 - 


-1.0 1.5 -1.5 -2.5 3.5 3.5 -3.5 1.5 2.5 1.0 






D 2 


12.25 2.25 


.00 2.25 2.25 6.25 12.25 12.25 12.25 2.25 6.25 1.00 


E7> 2 = 


72.50 


This result agrees well with the correlation coefficient obtained by other methods. 



Enter the fathers' heights 111 the iirst column and the sons" heights in the second column and give the 
SAS pull-down command Statistics — ► Descriptive — ► Correlations. The following output is obtained. 

The CORK Procedure 
2 Variables: Fatherht Sonht 

Simple Statistics 



Variable N Mean Std Dev Median Minimum Maximum 

Fatherht 12 66.66667 2.77434 67.00000 62.00000 71.00000 

Sonht 12 67.58333 1.88092 68.00000 65.00000 71.00000 
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Spearman Correlation Coefficient, N = 12 
prob > | r | under HO : Rho = 

Fatherht Sonht 
Fatherht 1.00000 0.74026 
0.0059 

Sonht 0.74026 1.00000 

0.0059 

In addition to the simple statistics, SAS gives the Spearman correlation coefficient as 0.74. The Rvalue, 
0.0059, can be used to test the null hypothesis that the population coefficient of rank correlation is equal to 
versus the alternative that the population coefficient of rank correlation is different from 0. We conclude that 
there is a relationship between the heights of fathers and sons in the population. 



Supplementary Problems 

THE SIGN TEST 

17.26 A company claims that if its product is added to an automobile's gasoline tank, the mileage per gallon will 
improve. To test the claim, 15 different automobiles are chosen and the mileage per gallon with and without 
the additive is measured; the results are shown in Table 17.30. Assuming that the driving conditions are 
the same, determine whether there is a difference due to the additive at significance levels of (a) 0.05 and 
(b) 0.01. 



34.7 28.3 19.6 25.1 



15.7 24.5 28.7 23.5 27.7 



29.6 22.4 25.7 28.1 24.3 



1.4 27.2 20.4 24.6 14.9 22.3 26.8 24.1 26.2 31.4 28.8 23.1 24.0 27.3 22.9 



17.27 Can one conclude at the 0.05 significance level that the mileage per gallon achieved in Problem 17.26 is better 
with the additive than without it? 



17.28 A weight4oss club advertises that a special program that it has designed will produce a weight loss of at least 
6% in 1 month if followed precisely. To test the club's claim, 36 adults undertake the program. Of these, 
25 realize the desired loss, 6 gain weight, and the rest remain essentially unchanged. Determine at the 
0.05 significance level whether the program is effective. 



17.29 \ training manager claims ihal b\ giving a special course lo company sales personnel, (he company's annual 
sales will increase. To test this claim, the course is given to 24 people. Of these 24, the sales of 16 increase, 
those of 6 decrease, and those of 2 remain unchanged. Test at the 0.05 significance level the hypothesis that 
the course increased the company's sales. 



17.30 The MW Soda Company sets up "taste tests" in 27 locations around the country in order to determine the 
public's relative preference for two brands of cola, A and B. In eight locations brand A is preferred over 
brand B, in 17 locations brand B is preferred over brand A, and in the remaining locations there is 
indifference. Can one conclude at the 0.05 significance level that brand B is preferred over brand A1 



17.31 The breaking strengths of a random sample of 25 ropes made by a manufacturer are given in Table 1 7.3 1 . On 
the basis of this sample, test at the 0.05 significance level the manufacturer's claim that the breaking strength 
of a rope is (a) 25, (b) 30, (c) 35, and (d) 40. 
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17.32 Show how to obtain 95% confidence limits for the 




17.33 Make up and solve a problem involving the sign test. 



THE MANN-WHITNEY U TEST 



17.34 Instructors A and B both teach a first course in chemistry at XYZ University. On a common final examina- 
tion, their students received the grades shown in Table 17.32. Test at the 0.05 significance level the hypothesis 
that there is no difference between the two instructors' grades. 



Table 17.32 

A 88 75 92 71 63 84 55 64 82 96 

B 12 65 84 53 76 80 51 60 57 85 94 87 73 61 



17.35 Referring to Problem 17.34, can one conclude at the 0.01 significance level that the students' grades in the 
morning class are worse than those in the afternoon class? 

17.36 A farmer wishes to determine whether there is a difference in yields between two different varieties of wheat, 
I and II. Table 17.33 shows the production of wheat per unit area using the two varieties. Can the farmer 
conclude at significance levels of (a) 0.05 and (b) 0.01 that a difference exists? 



15.9 15.3 16.4 14.9 15.3 16.0 14.6 15.3 14.5 16.6 



15.6 18.1 17.2 



17.37 Can the farmer of Problem 17.36 conclude at the 0.05 level that wheat II produces a larger yield than wheat I? 



17.38 A company wishes to determine whether there is a difference between two brands of gasoline, A and B. Table 
17.34 shows the distances traveled per gallon for each brand. Can we conclude at the 0.05 significance level 
(a) that there is a difference between the brands and (b) that brand B is better than brand A1 



A 30.4 28.7 29.2 32.5 31.7 29.5 30.8 31.1 30.7 31.8 



17.39 Can the U test be used to determine whether there is a difference between machines I and II of Table 17.1? 
Explain. 
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17.40 Make up and solve a problem using the U test. 

17.41 Find U for the data of Table 17.35, using (a) the formula method and (b) the counting method. 

17.42 Work Problem 17.41 for the data of Table 17.36. 



Table 17.35 



Table 17.36 



Sample 1 


15 25 


Sample 2 


20 32 



Sample 1 


40 27 30 56 


Sample 2 


10 35 



17.43 \ population consists of the values 2, 5, 9, and 12. Two samples are drawn from this population, the first 
consisting of one of these values and the second consisting of the other three values. 

(a) Obtain the sampling distribution of U and its graph. 

(b) Obtain the mean and variance of this distribution, both directly and by formula. 

17.44 Prove that U ] + U 2 = NiN 2 . 

17.45 Prove that + R 2 = [N(N + l)]/2 for the case where the number of ties is (a) 1, (b) 2, and (c) any number. 

17.46 If Ni = 14, N 2 = 12, and R t = 105, find (a) R 2 , (b) U u and (c) U 2 . 

17.47 If N t = 10, N 2 = 16, and U 2 = 60, find (a) R u (b) R 2 , and (c) U t . 

17.48 What is the largest number of the values N u N 2 , R { , R 2 , U\, and U 2 that can be determined from the 
remaining ones? Prove your answer. 



THE KRUSKAL-WALLIS H TEST 

17.49 An experiment is performed to determine the yields of five different varieties of wheat: A, B, C, D, and 
E. Four plots of land are assigned to each variety. The yields (in bushels per acre) are shown in Table 
17.37. Assuming that the plots have similar fertility and that the varieties are assigned to the plots at 
random, determine whether there is a significant difference between the yields at the (a) 0.05 and (b) 0.01 levels. 



16 18 14 



17 20 12 



21 14 17 



32 40 42 38 30 34 
31 37 35 33 34 30 



33 32 29 31 



17.50 A company wishes to test four different types of tires: A, B, C, and D. The lifetimes of the tires, as 
determined from their treads, are given (in thousands of miles) in Table 17.38; each type has been tried 
on six similar automobiles assigned to the tires at random. Determine whether there is a significant difference 
between the tires at the (a) 0.05 and (b) 0.01 levels. 
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17.51 A teacher wishes to test three different teaching methods: I, II, and III. To do this, the teacher chooses at 
random three groups of five students each and teaches each group by a different method. The same exam- 
ination is then given to all the students, and the grades in Table 17.39 are obtained. Determine at the (a) 0.05 
and (b) 0.01 significance levels whether there is a difference between the teaching methods. 



78 62 71 58 73 



76 85 77 90 87 



74 79 60 75 80 



17.52 During one semester a student received in various subjects the grades shown in Table 17.40. Test at the 
(a) 0.05 and (b) 0.01 significance levels whether there is a difference between the grades in these subjects. 



Table 17.40 



Mathematics 


72 80 83 75 


Science 


81 74 77 


English 


88 82 90 87 80 


Economics 


74 71 77 70 



17.53 Using the H test, work (a) Problem 16.9, (b) Problem 16.21, and (c) Problem 16.22. 

17.54 Using the H test, work (a) Problem 16.23, (b) Problem 16.24, and (c) Problem 16.25. 



THE RUNS TEST FOR RANDOMNESS 

17.55 Determine the number of runs, V, for each of these sequences: 
(a)ABABBAAABBAB 

(/OHHTHHHTTTTHHTHHTHT 

17.56 Twenty-five individuals were sampled as to whether they liked or did not like a product (indicated by Y and 
N, respectively). The resulting sample is shown by the following sequence: 

YYNNNNYYYNYNNYNNNNNYYYYNN 

(a) Determine the number of runs, V. 

(b) Test at the 0.05 signilicance level whether the responses are random. 

17.57 Use the runs test on sequences (10) and (IT) in this chapter, and state any conclusions about randomness. 

17.58 (a) Form all possible sequences consisting of two a's and one b, and give the number of runs, V, corre- 

sponding to each sequence. 

(b) Obtain the sampling distribution of V and its graph. 

(c) Obtain the probability distribution of V and its graph. 
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17.59 In Problem 17.58, find the mean and variance of V (a) directly from the sampling distribution and (b) by 
formula. 

17.60 Work Problems 17.58 and 17.59 for the cases in which there are (a) two a's and two b's. (b) one a and three 
b's, and (c) one a and four b's. 

17.61 Work Problems 17.58 and 17.59 for the case in which there are two a's and four b's. 



FURTHER APPLICATIONS OF THE RUNS TEST 

17.62 Assuming a significance level of 0.05, determine whether the sample of 40 grades in Table 17.5 is random. 

17.63 The closing prices of a stock on 25 successive days are given in Table 17.41. Determine at the 0.05 
significance level whether the prices are random. 



11.125 
11.250 
10.750 
11.375 
12.125 



10.875 
11.375 
11.500 
11.875 
11.750 



10.625 
10.750 
11.250 
11.125 
11.500 



11.500 
11.000 
12.125 
11.750 
12.250 



17.64 The first digits of V2 are 1.41421 35623 73095 0488- ••. What conclusions can you draw concerning the 
randomness of the digits? 

17.65 What conclusions can you draw concerning the randomness of the following digits? 

(a) V3 = l .73205 08075 68877 2935 •■ ■ 

(b) 7i = 3.14159 26535 89793 2643- •• 

17.66 In Problem 17.62, show that the p-value using the normal approximation is 0.105. 

17.67 In Problem 17.63, show that the p-value using the normal approximation is 0.168. 

17.68 In Problem 17.64, show that the p-value using the normal approximation is 0.485. 



RANK CORRELATION 

17.69 In a contest, two judges were asked to rank eight candidates (numbered 1 through 8) in order of preference. 
The judges submitted the choices shown in Table 17.42. 

(a) Find the coefficient of rank correlation. 

(b) Decide how well the judges agreed in their choices. 



Table 17.42 



First judge 


5 2 8 1 4 6 3 7 


Second judge 


4 5 7 3 2 8 1 6 
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17.70 Table 14.17 is reproduced below and gives the United States consumer price indexes for food and medical- 
care costs during the years 2000 through 2006 compared with prices in the base years, 1982 through 1984 
(mean taken as 100). 



Year 2000 2001 2002 2003 2004 2005 2006 

Food 167.8 173.1 176.2 180.0 186.2 190.7 195.2 

Medical 260.8 272.8 285.6 297.1 310.1 323.2 336.2 
Source: Bureau of Labor Statistics. 



Find the Spearman rank correlation for the data and the Pearson correlation coefficient. 

17.71 The rank correlation coefficient is derived by using the ranked data in the product-moment formula of 
Chapter 14. Illustrate this by using both methods to work a problem. 

17.72 Can the rank correlation coefficient be found for grouped data? Explain this, and illustrate your answer with 
an example. 



Statistical Process 
Control and Process 
Capability 



GENERAL DISCUSSION OF CONTROL CHARTS 

Variation in any process is due to common causes or special causes. The natural variation that exists 
in materials, machinery, and people gives rise to common causes of variation. In industrial settings, 
special causes, also known as assignable causes, are due to excessive tool wear, a new operator, a change 
of materials, a new supplier, etc. One of the purposes of control charts is to locate and, if possible, 
eliminate special causes of variation. The general structure of a control chart consists of control limits 
and a centerline as shown in Fig. 18-1. There are two control limits, called an upper control limit or UCL 
and a lower control limit or LCL. 



Common causes 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Time 

Fig. 18-1 Control charts are of two types (variable and attribute control charts). 
480 
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When a point on the control chart falls outside the control limits, the process is said to be out of 
statistical control. There are other anomalous patterns besides a point outside the control limits that also 
indicate a process that is out of control. These will be discussed later. It is desirable for a process to be in 
control so that its' behavior is predictable. 



VARIABLES AND ATTRIBUTES CONTROL CHARTS 

Control charts may be divided into either variables control charts or attributes control charts. The 
terms "variables" and "attributes" are associated with the type of data being collected on the process. 
When measuring characteristics such as time, weight, volume, length, pressure drop, concentration, etc., 
we consider such data to be continuous and refer to it as variables data. When counting the number of 
defective items in a sample or the number of defects associated with a particular type of item, the 
resulting data are called attributes data. Variables data are considered to be of a higher level than 
attributes data. Table 18.1 gives the names of many of the various variables and attributes control charts 
and the statistics plotted on the chart. 



Table 18.1 



Chart Type 


Statistics Plotted 


X-b&r and R chart 
X-b&r and Sigma chart 
Median chart 
Individuals chart 
Cusum chart 
Zone chart 
EWMA chart 


Averages and ranges of subgroups of variables data 

Averages and standard deviations of subgroups of variables data 

Median of subgroups of variables data 

Individual measurements 

Cumulative sum of each X minus the nominal 

Zone weights 

Exponentially weighted moving average 


P-chart 
JVP-chart 
C-chart 
[/-chart 


Ratio of defective items to total number inspected 
Actual number of defective items 
Number of defects per item for a constant sample size 
Number of defects per item for varying sample size 



The charts above the dashed line in Table 18.1 are variables control charts and the charts below the 
dashed line are attributes control charts. We shall discuss some of the more basic charts. MINITAB will 
be used to construct the charts. Today, charting is almost always accomplished by the use of statistical 
software such as MINITAB. 



AT-BAR AND R CHARTS 

The general idea of an .V-bar chart can be understood by considering a process having mean \i and 
standard deviation a. Suppose the process is monitored by taking periodic samples, called subgroups, of 
size n and computing the sample mean, X, for each sample. The central limit theorem assures us that the 
mean of the sample mean is n and the standard deviation of the sample mean is a/^/n. The centerline for 
the sample means is taken to be u and the upper and lower control limits are taken to be 2>{a/y/n) above 
and below the centerline. The lower control limit is given by equation (7): 



LCL = u - 3(o-/Vii) 



(1) 



4H2 
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The upper control limit is given by equation (2): 

UCL=n+3(<r/V5) (2) 

For a normally distributed process, a subgroup mean will fall between the limits, given in (7) and (2), 
99.7% of the time. In practice, the process mean and the process standard deviation are unknown and 
need to be estimated. The process mean is estimated by using the mean of the periodic sample means. 
This is given by equation (5), where m is the number of periodic samples of size n selected. 

The mean,X, can also be found by summing all the data and then dividing by mn. The process standard 
deviation is estimated by pooling the subgroup variances, averaging the subgroup standard deviations or 
ranges, or by sometimes using a historical value of a. 



EXAMPLE 1. Data are obtained on the width of a product. Five observations per time period are sampled for 
20 periods. The data are shown below in Table 18.2. The number of periodic samples is m = 20, the sample size 
or subgroup size is n=5, the sum of all the data is 199.84, and centerline is X = 1.998. The MINITAB pull- 
down menu "Stat => Control charts => .Ybar" was used to produce the control chart shown in Fig. 18-2. The 
data in Table 18.2 are stacked into a single column before applying the above pull-down menu sequence. 



2.007 
1.988 
2.002 
1.978 
2.012 



2.008 
2.007 



2.011 
2.005 



2.018 
2.009 



1.986 
2.010 



The standard deviation for the process may be estimated in four different ways: By using the average 
of the 20 subgroup ranges, by using the average of the 20 subgroup standard deviations, by pooling the 
20 subgroup variances, or by using a historical value for a, if one is known. MINITAB allows for all four 
options. The 20 means for the samples shown in Table 18.2 are plotted in Fig. 18-2. The chart indicates 
a process that is in control. The individual means randomly vary about the centerline and none fall 
outside the control limits. 
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The R chart is used to track process variation. The range, R, is computed for each of the m 
subgroups. The centerline for the R chart is given by equation (4). 

R^ (4) 
m 

As with the Z-bar chart, several different methods are used to estimate the standard deviation of the 
process. 



EXAMPLE 2. For the data in Table 18.2, the range of the first subgroup is R x = 2.000 - 1.975 = 0.025 and the 
range for the second subgroup is R 2 = 2.012 - 1.978 = 0.034. The 20 ranges are: 0.025, 0.034, 0.038, 0.031, 0.028, 
0.030, 0.054, 0.039, 0.026, 0.048, 0.026, 0.018, 0.012, 0.027, 0.030, 0.027, 0.049, 0.053, 0.047 and 0.020. The mean of 
these 20 ranges is 0.0327. A MINITAB plot of these ranges is shown in Fig. 18-3. The R chart does not indicate any 
unusual patterns with respect to variability. The MINITAB pull-down menu "Stat — ► Control charts — > R chart" is 
used to produce the control chart shown in Fig. 18-3. The data in Table 18.2 are stacked into a single column before 
applying the above pull-down menu sequence. 

0.07 

0.06 

0.05 
| 0.04 
1 0.03 
m 0.02 

0.01 

0.00 

2 4 6 8 10 12 14 16 18 20 
Sample 

Fig. 18-3 R chart for width. 
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TESTS FOR SPECIAL CAUSES 

In addition to a point falling outside the control limits of a control chart, there are other indications 
that are suggestive of non-randomness of a process caused by special effects. Table 18.3 gives eight tests 
for special causes. 

Table 18.3 Tests for Special Causes 

1 . One point more than 3 sigmas from centerline 

2. Nine points in a row on same side of centerline 

3. Six points in a row, all increasing or all decreasing 

4. Fourteen points in a row, alternating up and down 

5. Two out of three points more than 2 sigmas from centerline (same side) 

6. Four out of five points more than 1 sigma from centerline (same side) 

7. Fifteen points in a row within 1 sigma of centerline (either side) 

8. Eight points in a row more than 1 sigma from centerline (either side) 



PROCESS CAPABILITY 

To perform a capability analysis on a process, the process needs to be in statistical control. It is 
usually assumed that the process characteristic being measured is normally distributed. This may be 
checked out using tests for normality such as the Kolmogorov-Smirnov test, the Ryan-Joiner test, or the 
Anderson-Darling test. Process capability compares process performance with process requirements. 
Process requirements determine specification limits. LSL and USL represent the lower specification limit 
and the upper specification limit. 

The data used to determine whether a process is in statistical control may be used to do the 
capability analysis. The 3-sigma distance on either side of the mean is called the process spread. The 
mean and standard deviation for the process characteristic may be estimated from the data gathered for 
the statistical process control study. 



EXAMPLE 3. As we saw in Example 2, the data in Table 18.2 come from a process that is in statistical control. 
We found the estimate of the process mean to be 1.9984. The standard deviation of the 100 observations is found 
to equal 0.013931. Suppose the specification limits are LSL = 1.970 and USL = 2.030. The Kolmogorov-Smirnov 
test for normality is applied by using M1NITAB and it is found that we do not reject the normality of the 
process characteristic. The nonconformance rates are computed as follows. The proportion above the 
USL = P{X> 2.030) = P[{X- 1.9984)/0.013931 > (2.030- 1.9984) /0.013931] = P{Z > 2.27) = 0.01 16. That is, 
there are 0.0116(1,000,000)= 11,600 parts per million (ppm) above the USL that are nonconforming. Note that 
P(Z> 2.27 ) may be found using M1N1TAB rather than looking it up in the standard normal tables. This is done as 
follows. Use the pull-down mean Calc — » Probability Distribution — » Normal. 

X=2.21 gives the following 

x P(X^x) 
2.2700 0.9884 

We have P(Z<2.27) = 0.9884 and therefore P(Z > 2.27) = 1 - 0.9884 = 0.0116. 

Similarly, the proportion below the LSL= P{X< 1.970) = P{Z< -2.04) = 0.0207. There are 20,700 ppm below 
the LSL that are nonconforming. Again, MINITAB is used to find the area to the left of -2.04 under the standard 
normal curve. 

The total number of nonconforming units is 11,600 + 20,700 = 32,300 ppm. This is of course an unacceptably high 
number of nonconforming units. 
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Suppose p. represents the estimated mean for the process characteristic and a represents the 
estimated standard deviation for the process characteristic, then the nonconformance rates are 
estimated as follows: The proportion above the USL equals 

p(x>usl) = p(z>H^) 

and the proportion below the LSL equals 



The process capability index measures the process's potential for meeting specifications, and is 
defined as follows: 

^ _ allowable spread _ USL - LSL 
p measured spread 6d 



EXAMPLE 4. For the process data in Table 18.2, USL - LSL = 2.030 - 1.970 = 0.060, 6a = 6(0.013931) = 
0.083586, and C P = 0.060/0.083586 = 0.72. 



The Cp K index measures the process performance, and is defined as follows: 
C PK = m 



fUSL-Li u-LSLl 
{ M ' 3<r J 



EXAMPLE 5. For the process data in Example 



f2.030- 1.9984 1.9984- 1.9701 .. r _ 
C PK = — I 3(0 Q13931) , 3(0 . 013931) } = — um{0.76, 0.68} = 0.68 

For processes with only a lower specification limit, the lower capability index C PL is defined as 
follows: 

u-LSL 

For processes with only an upper specification limit, the upper capability index C PU is defined as 
follows: 

r USL ~ A m 

pu = 3& ' ^ 

Then C PK may be defined in terms of C PL and C PU as follows: 

C PK = min{C PL , C PU } (9) 
The relationship between nonconformance rates and C PL and C PU are obtained as follows: 

P(X < LSL) = p[z < LS V~ - P(Z < -3C PL ), since - 3C PL = LS \~ - 

USL - Q 

= P(Z > 3C PU ), since 3C PU = — - 
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EXAMPLE 6. Suppose that C PL = 1.1, then the proportion nonconforming is P(Z < -3(1.1)) = P(Z < -3.3). 
This may be found using MINITAB as follows. Give the pull-down "Calc=> Probability Distribution => Normal." 

The cumulative area to the left of -3.3 is given as: 



Cumulative Distribution Function 

Normal with mean = and standard deviation = 1 
x P(X^x) 
-3.3 0.00048348 

There would be 1,000,000 x 0.00048348 = 483 ppm nonconforming. Using this technique, a (able relating noncon- 
formance rate to capability index can be constructed. This is given in Table 18.4. 



Table 18.4 



C PL or C PU 


Proportion Nonconforming 


ppm 




0.38208867 


382089 


0.2 


0.27425308 


274253 


0.3 


0.18406010 


184060 


0.4 


0.11506974 


115070 


0.5 


0.06680723 


66807 


0.6 


0.03593027 


35930 


0.7 


0.01786436 


17864 


0.8 


0.00819753 


8198 


0.9 


0.00346702 


3467 


1.0 


0.00134997 


1350 


1.1 


0.00048348 


483 


1.2 


0.00015915 


159 


1.3 


0.00004812 


48 


1.4 


0.00001335 


13 


1.5 


0.00000340 


3 


1.6 


0.00000079 


1 


1.7 


0.00000017 





1.8 


0.00000003 





1.9 


0.00000001 





2.0 


0.00000000 






EXAMPLE 7. A capability analysis using MINITAB and the data in Table 18.2 may be obtained by using 
the following pull-down menus in MINITAB Stat => Quality tools => Capability Analysis (Normal). The 
MINITAB output is shown in Fig. 18-4. The output gives nonconformance rates, capability indexes, and 
several other measures. The quantities found in Examples 3, 4, and 5 are very close to the corresponding measures 
shown in the figure. The differences are due to round-off error as well as different methods of estimating certain 
parameters. The graph is very instructive. It shows the distribution of sample measurements as a histogram. 
The population distribution of process measurements is shown as the normal curve. The tail areas under the 
normal curve to the right of the USL and to the left of the LSL represent the percentage of nonconforming 
products. By multiplying the sum of these percentages times one million, we get the ppm non-conformance rate 
for the process. 
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Process capability of width 



Process Data 


LSL 


1 .97000 






USL 


2.03000 


Sample Mean 


1 .99842 


Sample N 


100 


StDev (Within) 


0.01406 


StDev (Overall) 


0.01397 




Potential (Within) Capability 
Cp 0.71 
CPL 0.67 
CPU 0.75 
Cpk 0.67 
CCpk 0.71 
Overall Capability 

Pp 0.72 

PPL 0.68 

PPU 0.75 

Ppk 0.68 

Cpm 



2.00 2.01 2.02 2.03 



Observed Performance 
PPM < LSL 20000.00 
PPM > USL 20000.00 
PPM Total 40000.00 



Exp. Within Performance 
PPM < LSL 21647.20 
PPM > USL 12366.21 
PPM Total 34013.41 



Exp. Overall Performance 
PPM < LSL 20933.07 
PPM > USL 11876.48 
PPM Total 32809.56 



Fig. 18-4 Several capabilih 



are given for the process. 



P- AND NP-CHARTS 

When mass-produced products are categorized or classified, the resulting data are called attributes 
data. After establishing standards that a product must satisfy, specifications are determined. An item not 
meeting specifications is called a non-conforming item. A nonconforming item that is not usable is called a 
defective item. A defective item is considered to be more serious than a nonconforming item. An item 
might be nonconforming because of a scratch or a discoloration, but not be a defective item. The failure 
of a performance test would likely cause the product to be classified as defective as well as nonconform- 
ing. Flaws found on a single item are called nonconformities. Nonrepairable flaws are called defects. 

Four different control charts are used when dealing with attributes data. The four charts are the 
P-, NP-, C-, and [/-chart. The P- and the TAP-charts are based on the binomial distribution and the 
C- and [/-charts are based on the Poisson distribution. The P-chart is used to monitor the proportion 
of nonconforming items being produced by a process. The P-chart and the notation used to describe it 
are illustrated in Example 8. 



EXAMPLE 8. Suppose 20 respirator masks are examined every thirty minutes and the number of defective units 
are recorded per 8-hour shift. The total number examined on a given shift is equal to n = 20(16) = 320. Table 18.5 
gives the results for 30 such shifts. The center line for the />-chart is equal to the proportion of defectives for the 
30 shifts, and is given by total number of defectives divided by the total number examined for the 30 shifts, or 

p = 72/9600 = 0.0075. 

The standard deviation associated with the binomial distribution, which underlies this chart, is 



The 3-sigma control limits for this process are 
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Table 18.5 





Number 


Proportion 




Number 


Proportion 




Defective 


Defective 




Defective 


Defective 


Shift # 


X t 


Pi = X/n 


Shift # 


x t 


P, =X/n 


1 


1 


0.003125 


16 


2 


0.006250 


2 


2 


0.006250 


17 





0.000000 


3 


2 


0.006250 


18 


4 


0.012500 


4 





0.000000 


19 


1 


0.003125 


5 


4 


0.012500 


20 


7 


0.021875 


6 


4 


0.012500 


21 


4 


0.012500 


7 


4 


0.012500 


22 


1 


0.003125 


8 


6 


0.018750 


23 





0.000000 


9 


4 


0.012500 


24 


4 


0.012500 


10 





0.000000 


25 


4 


0.012500 


11 





0.000000 


26 


3 


0.009375 


12 





0.000000 


27 




0.006250 


13 


1 


0.003125 


28 





0.000000 


14 





0.000000 


29 





0.000000 


15 


7 


0.021875 


30 


5 


0.015625 




The lower control limit is LCL = 0.0075 - 3(0.004823) = -0.006969. When the LCL is negative, it is taken to 
be zero since the proportion defective in a sample can never be negative. The upper control limit is UCL = 0.0075 + 
3 (0.004823) = 0.021969. 

The MINITAB solution to obtaining the P-chart for this process is given by using the pull-down menus 
Stat -> Control charts -> P. The P-chart is shown in Fig. 18-5. Even though it appears that samples 15 and 



0.025 




20 indicate the presence of a special cause, when the proportion defective for samples 15 and 20 ( both equal 
to 0.021875 ) are compared with the UCL = 0.021969, it is seen that the points are not beyond the UCL. 
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The NP-chart monitors the number of defectives rather than the proportion of defectives. The 
/VP-chart is considered by many to be preferable to the P-chart because the number defective is easier 
for quality technicians and operators to understand than is the proportion defective. The centerline for 
the iVT-chart is given by np and the 3-sigma control limits are 

np±3y/np(l-p) (11) 

EXAMPLE 9. For the data in Table 18.5, the centerline is given by np = 320(.0075) = 2.4 and the control limits are 
LCL = 2.4 - 4.63 = -2.23, which we take as 0, and UCL = 2.4 + 4.63 = 7.03. If 8 or more defectives are found 
on a given shift, the process is out of control. The MINITAB solution is found by using the pull-down sequence 



Stat -> Control charts -> NP 




3 6 9 12 15 18 21 24 27 30 
Sample 



Fig. 18-6 The NP-chart monitors the number defective. 



The number defective per sample needs to be entered in some column of the worksheet prior to executing 
the pull-down sequence. The MINITAB TVP-chart is shown in Fig. 18-6. 



OTHER CONTROL CHARTS 

This chapter serves as only an introduction to the use of control charts to assist in statistical process 
control. Table 18.1 gives a listing of many of the various control charts in use in industrial settings today. 
To expedite calculations on the shop floor, the median chart is sometimes used. The medians of the 
samples are plotted rather than the means of the samples. If the sample size is odd, then the median is 
simply the middle value in the ordered sample values. 

For low-volume production runs, individuals charts are often used. In this case, the subgroup or 
sample consists of a single observation. Individuals charts are sometimes referred to as X charts. 

A zone chart is divided into four zones. Zone 1 is defined as values within 1 standard deviation 
of the mean, zone 2 is defined as values between 1 and 2 standard deviations of the mean, zone 3 is 
defined as values between 2 and 3 standard deviations of the mean, and zone 4 as values 3 or more 
standard deviations from the mean. Weights are assigned to the four zones. Weights for points on 
the same side of the centerline are added. When a cumulative sum is equal to or greater than the 
weight assigned to zone 4, this is taken as a signal that the process is out of control. The cumulative 
sum is set equal to after signaling a process out of control, or when the next plotted point crosses 
the centerline. 
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The exponentially weighted moving average (EWMA chart) is an alternative to the individuals or 
X-bar chart that provides a quicker response to a shift in the process average. The EWMA chart 
incorporates information from all previous subgroups, not only the current subgroup. 

Cumulative sums of deviations from a process target value are utilized by a Cusum chart. Both the 
EWMA chart and the Cusum chart allow for quick detection of process shifts. 

When we are concerned with the number of nonconformities or defects in a product rather than 
simply determining whether the product is defective or non defective, we use a C-chart or a U-chart. 
When using these charts, it is important to define an inspection unit. The inspection unit is defined as 
the fixed unit of output to be sampled and examined for nonconformities. When there is only one 
inspection unit per sample, the C-chart is used, and when the number of inspection units per sample 
vary, the {/-chart is used. 



Solved Problems 



JT-BAR AND R CHARTS 



18.1 An industrial process fills containers with breakfast oats. The mean fill for the process is 510 
grams (g) and the standard deviation of fills is known to equal 5 g. Four containers are selected 
every hour and the mean weight of the subgroup of four weights is used to monitor the process for 
special causes and to help keep the process in statistical control. Find the lower and upper control 
limits for the Z-bar control chart. 
SOLUTION 

In this problem, we are assuming that |i and a are known and equal 510 and 5, respectively. 
When u. and a are unknown, they must be estimated. The lower control limit is LCL = /i - 3(cr/^/w) = 
510 - 3(2.5) = 502.5 and the upper control limit is UCL = fj, + 3(cr/v^) = 510 + 3(2.5) = 517.5 



Period Period 



1.987 
1.983 



2.019 1.976 
2.021 2.007 



1.991 
1.972 



Period Period 



2.004 
1.980 



1.994 
2.006 



2.008 
2.007 



1.999 
1.984 



2.011 
2.005 



2.018 
2.009 
2.023 
2.010 
1.993 



2.025 
2.022 
2.035 
2.013 
2.020 



2.002 
1.969 
2.018 
1.984 
1.990 
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Table 18.6 contains the widths of a product taken at 20 time periods. The control limits 
for an Z-bar chart are LCL= 1.981 and UCL = 2.018. Are there any of the subgroup means 
outside the control limits? 

SOLUTION 

The means for the 20 subgroups are 1.9896, 1.9974, 2.0032, 1.9916, 2.0014, 1.9890, 1.9942, 1.9952, 
2.0058, 2.0066, 1.9964, 1.9928, 2.0024, 1.9974, 2.0106, 2.0230, 1.9926, 1.9948, 2.0012, and 2.0044, 
respectively. The sixteenth mean, 2.0230, is outside the upper control limit. All others are within the control 



18.3 Refer to Problem 18.2. It was determined that a spill occurred on the shop floor just before 
the sixteenth subgroup was selected. This subgroup was eliminated and the control limits were 
re-computed and found to be LCL= 1.979 and UCL = 2.017. Are there any of the means other 
than the mean for subgroup 16 outside the new limits. 

SOLUTION 

None of the means given in Problem 18.2 other than the sixteenth one fall outside the new limits. 
Assuming that the new chart does not fail any of the other tests for special causes given in table 18.3, the 
control limits given in this problem could be used to monitor the process. 



Verify the control limits given in Problem 18.2. Estimate the standard deviation of the process by 
pooling the 20 sample variances. 

SOLUTION 

The mean of the 100 sample observations is 1.999. One way to find the pooled variance for the 
20 samples is to treat the 20 samples, each consisting of 5 observations, as a one-way classification. 
The within or error mean square is equal to the pooled variance of the 20 samples. The M1NITAB analysis 
as a one-way design gave the following analysis of variance table. 



Factor 19 0.006342 0.000334 1.75 0.044 
Error 80 0.015245 0.000191 
Total 99 0.021587 

The estimate of the standard deviation is V0. 000191 = 0.01382. The lower control limit is LCL = 
1.999 - 3(0.01382/^/5) = 1.981 and the upper control limit is UCL = 1.999 + 3(0.01382/ v / 5) = 2.018. 



TESTS FOR SPECIAL CAUSES 

18.5 Table 18.7 contains data from 20 subgroups, each of size 5. The Z-bar chart is given in Fig. 18-7. 
What effect did a change to a new supplier at time period 10 have on the process? Which test for 
special causes in Table 18.3 did the process fail? 

SOLUTION 

The control chart in Fig. 18-7 shows that the change to the new supplier caused an increase in the width. 
This shift after time period 10 is apparent. The 6 shown on the graph in Fig. 18-7 indicates test 6 given in 
Table 18.3 was failed. Four out of five points were more than 1 sigma from the centerline (same side). 
The five points correspond to subgroups 4 through 8. 
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Period Period 



1 .989 
1.989 
1.997 
1 .976 
2.007 



Period Period 



1 .4X3 
1 .972 



1.991 
1 .997 



1 .966 
1 .982 
1.995 
2.020 
2.008 



Period Period 



Period Period 



2.014 
1.990 
2.008 
2.004 
2.016 



2.001 
2.013 
2.007 



2.006 
2.015 
2.006 
2.018 
2.017 



2.021 
2.015 



2.028 
2.019 
2.033 
2.020 
2.003 



2.020 
2.022 
2.023 



PROCESS CAPABILITY 

18.6 Refer to Problem 18.2. After determining that a special cause was associated with subgroup 16, 
we eliminate this subgroup. The mean width is estimated by finding the mean of the data from the 
other 19 subgroups and the standard deviation is estimated by finding the standard deviation 
of the same data. If the specification limits are LSL= 1.960 and USL = 2.040, find the lower 
capability index, the upper capability index, and the C PK index. 
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SOLUTION 

Using the 95 measurements after excluding subgroup 16, we find that £ = 1.9982 and a = 0.01400. 
The lower capability index is 

_ A-LSL _ 1.9982- 1.960 _ 
0.0420 



C PL =^ 



- = 0.910 



the upper capability index is 

Cpu 

and C PK = min{C PL , C PU } = 0.91. 



USL 2.040- 1.9982 _ 

3a ~ 0.042 



18.7 Refer to Problem 18.1. (a) Find the percentage nonconforming if LSL = 495 and USL = 525. 
(b) Find the percentage nonconforming if LSL = 490 and USL = 530. 

SOLUTION 

(a) Assuming the fills are normally distributed, the area under the normal curve below the LSL is found by 
using the EXCEL command =NORMDlST (495 , 510 , 5 , 1) which gives 0.001350. By symmetry, the 
area under the normal curve above the USL is also 0.001350. The total area outside the specification 
limits is 0.002700. The ppm nonconforming is 0.002700(1,000,000) = 2700. 

(b) The area under the normal curve for LSL = 490 and USL = 530 is found similarly to be 
0.000032 + 0.000032 = 0.000064. The ppm is found to be 0.000064(1,000,000) = 64. 

P- AND A^-CHARTS 

18.8 Printed circuit boards are inspected for defective soldering. Five hundred circuit boards per day 
are tested for a 30-day period. The number defective per day are shown in Table 18.8. Construct 
a P-chart and locate any special causes. 



Day 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


# Defective 


2 







5 




4 


5 


1 




3 


Day 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


# Defective 


3 


2 





4 


3 


8 


10 


4 


4 


5 


Day 


21 


22 


23 


24 


25 


26 


27 


28 


29 


30 


# Defective 


2 


4 


3 


2 


3 


3 


2 


1 


1 


2 



SOLUTION 

The confidence limits are 

The centerline is p = 92/15,000 = 0.00613 and the standard deviation is 

= 0.00349 



/(0.00613)(0.99387) _ 



500 
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The lower control limit is 0.00613 - 0.01047 = -0.00434, and is taken to equal 0, since proportions 
cannot be negative. The upper control limit is 0.00613 + 0.01047 = 0.0166. The proportion of defectives on 
day 17 is equal to P 17 = 10/500 = 0.02 and is the only daily proportion to exceed the upper limit. 

18.9 Give the control limits for an AP-chart for the data in Problem 18.8. 
SOLUTION 

The control limits for the number defective are np = 3y/np(l - p). The centerline is np = 3.067. 
The lower limit is and the upper limit is 8.304. 

18.10 Suppose respirator masks are packaged in either boxes of 25 or 50 per box. At each 30 minute 
interval during a shift a box is randomly chosen and the number of defectives in the box deter- 
mined. The box may either contain 25 or 50 masks. The number checked per shift will vary 
between 400 and 800. The data are shown in Table 18.9. Use MINITAB to find the control 
chart for the proportion defective. 



Table 18.9 





Sample 


Number 


Proportion 




Size 


Defective 


Defective 


Shift # 




%i 


Pi = Xihi 


1 


400 


3 


0.0075 


2 


575 


7 


0.0122 


3 


400 


1 


0.0025 


4 


800 


7 


0.0088 


5 


475 


2 


0.0042 


6 


575 





0.0000 


7 


400 


8 


0.0200 


8 


625 


1 


0.0016 


9 


775 


10 


0.0129 


10 


425 


8 


0.0188 


11 


400 


7 


0.0175 


12 


400 




0.0075 


13 


625 


6 


0.0096 


14 


800 


5 


0.0063 


15 


800 


4 


0.0050 


16 


800 


7 


0.0088 


17 


475 


9 


0.0189 


18 


800 


9 


0.0113 


19 


750 


9 


0.0120 


20 


475 


2 


0.0042 



SOLUTION 

When the sample sizes vary in monitoring a process for defectives, the centerline remains the same, that 
is, it is the proportion of defectives over all samples. The standard deviation, however, changes from sample 
to sample and gives control limits consisting of stair-stepped control limits. The control limits are 
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The centerline is p = 108/11,775 = 0.009172. For the first subgroup, we have «, = 400 
lp(\ -p) /(0.009172)(0.990828) 

V^~~ = V 40 



'- = 0.004767 



and 3(0.004767) = 0.014301. The lower limit for subgroup 1 is and the upper limit is 0.009172 + 
0.014301 = 0.023473. The limits for the remaining shifts are determined similarly. These changing limits 
give rise to the stair-stepped upper control limits shown in Fig. 18-8. 



0.025 
0.020 
0.015 
0.010 
0.005 
0.000 




UCL = 0.02229 



2 4 6 8 10 12 14 16 18 20 
Sample 

Tests performed with unequal sample sizes 

Fig. 18-8 P-chart with unequal sample sizes 



OTHER CONTROL CHARTS 



18.11 When measurements are expensive, data are available at a slow rate, or when output at any 
point is fairly homogeneous, an individuals chart with moving range may be indicated. The 
data consist of single measurements taken at different points in time. The centerline is 
the mean of all the individual measurements, and variation is estimated by using moving 
ranges. Traditionally, moving ranges have been calculated by subtracting adjacent data 
values and taking the absolute value of the result. Table 18.10 gives the coded breaking 
strength measurements of an expensive cable used in aircraft. One cable per day is selected 
from the production process and tested. Give the MINITAB-generated individuals chart and 
interpret the output. 



Table 18.10 



Day 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


Strength 


491.5 


502.0 


505.5 


499.6 


504.1 


501.3 


503.5 


504.3 


498.5 


508.8 


Day 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


Strength 


515.4 


508.0 


506.0 


510.9 


507.6 


519.1 


506.9 


510.9 


503.9 


507.4 
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SOLUTION 

The following pull-down menus are used: Stats — > Control charts — » individuals. 



525 



520 

o) 515 
> 510 
| 505 
~ 500 

495 

490 

2 4 6 8 10 12 14 16 18 20 
Observation 

Fig. 18-9 Individuals chart for strength. 



Figure 18-9 shows the individuals chart for the data in Table 18.10. The individual values in 
Table 18.10 are plotted on the control chart. The 2 that is shown on the control chart for weeks 9 and 
18 corresponds to the second test for special causes given in Table 18.3. This indication of a special 
cause corresponds to nine points in a row on the same side of the centerline. An increase in the process 
temperature at time period 10 resulted in an increase in breaking strength. This change in breaking 
strength resulted in points below the centerline prior to period 10 and mostly above the centerline after 
period 10. 



18.12 The exponentially weighted moving average, EWMA chart is used to detect small 
shifts from a target value, t. The points on the EWMA chart are given by the following 
equation: 

x i = wx i + (l-w)x i _ 1 

To illustrate the use of this equation, suppose the data in Table 18.7 were selected from a 
process that has target value equal to 2.000. The stating value x is chosen to equal the target 
value, 2.000. The weight w is usually chosen to be between 0.10 and 0.30. MINITAB 
uses the value 0.20 as a default. The first point on the EWMA chart would be 
x 1 =wx 1 +(l-w)x = 0.20(1.9896) + 0.80(2.000) = 1.9979. The second point on the chart 
would be x 2 = wx 2 +{\ -w)x { = 0.20(1.9974) +0.80(1.9979) = 1.9978, and so forth. The 
MINITAB analysis is obtained by using the following pull-down menu Stat — » Control charts 
-> EWMA. The target value is supplied to MINITAB. The output is shown in Fig. 18-10. 
By referring to Fig. 18-10, determine for which subgroups the process shifted from the target 
value. 

SOLUTION 

The graph of the x,- values crosses the upper control limit at the time point 15. This is the point at which 
we would conclude that the process had shifted away from the target value. Note that the EWMA chart has 
stair-stepped control limits. 
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2.0100 
2.0075 
2.0050 
< 2.0025 
m 2.0000 
1 .9975 
1 .9950 



2 4 6 8 10 12 14 16 18 20 
Sample 

Fig. 18-10 Exponentially Weighted Moving Average chart. 

18.13 A zone chart is divided into four zones. Zone 1 is defined as values within 1 standard deviation of 
the mean, zone 2 is defined as values between 1 and 2 standard deviations of the mean, zone 3 is 
defined as values between 2 and 3 standard deviations of the mean, and zone 4 as values 3 or more 
standard deviations from the mean. The default weights assigned to the zones by MINITAB are 
0, 2, 4, and 8 for zones 1 through 4. Weights for points on the same side of the centerline are 
added. When a cumulative sum is equal to or greater than the weight assigned to zone 4, this 
is taken as a signal that the process is out of control. The cumulative sum is set equal to 
after signaling a process out of control, or when the next plotted point crosses the centerline. 
Figure 18-11 shows the MINITAB analysis using a zone chart for the data in Table 18.6. 
The pull-down menus needed to produce this chart are Stat — ► Control charts — ► Zone. What 
out of control points does the zone chart find? 

SOLUTION 

Subgroup 16 corresponds to an out of control point. The zone score corresponding to subgroup 16 is 10, 
and since this exceeds the score assigned to zone 4, this locates an out of control time period in the process. 



+3 StDev = 2.01806 
+2StDev = 2.01187 
+1 StDev = 2.00567 

X= 1.99948 

-1 StDev = 1 .99329 

-2 StDev = 1 .98709 

-3 StDev = 1.98090 



2 4 6 8 10 12 14 16 18 20 
Sample 

Fig. 18-11 Zone chart for width. 
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18.14 When we are concerned with the number of nonconformities or defects in a product rather than 
simply determining whether the product is defective or nondefective, we use a C-chart or a 
U-chart. When using these charts, it is important to define an inspection unit. The inspection 
unit is defined as the fixed unit of output to be sampled and examined for nonconformities. When 
there is only one inspection unit per sample, the C-chart is used; when the number of inspection 
units per sample varies, the {/-chart is used. 

One area of application for C- and [/-charts is in the manufacture of roll products such as 
paper, films, plastics, textiles, and so forth. Nonconformities or defects, such as the occurrence of 
black spots in photographic film, as well as the occurrence of fiber bundles, dirt spots, pinholes, 
static electricity marks, agglomerates in various other roll products, always occur at some level in 
the manufacture of roll products. The purpose of the C- or [/-chart is to make sure that the 
process output remains within an acceptable level of occurrence of such nonconformities. 
These nonconformities often occur randomly and independently of one another over the total 
area of the roll product. In such cases, the Poisson distribution is used to form the control chart. 
The centerline for the C-chart is located at c, the mean number of nonconformities over all 
subgroups. The standard deviation of the Poisson distribution is and therefore the 3 sigma 
control limits are c ± 3\^. That is, the lower limit is LCL — c- 3^ and the upper limit is 
UCL = c+3^. 

When a coating is applied to a material, small nonconformities called agglomerates 
sometimes occur. The number of agglomerates in a length of 5 feet (ft) are recorded for a 
jumbo roll of product. The results for 24 such rolls are given in Table 18.11. Are there any points 
outside the 3-sigma control limits? 



Table 18.11 



Jumbo roll # 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


Agglomerates 


3 


3 


6 





7 


5 




6 


3 


5 


2 




Jumbo roll # 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


24 


Agglomerates 


2 


7 


6 


4 


7 


8 


5 


13 


7 


3 


3 


7 



SOLUTION 

The mean number of agglomerates per jumbo roll is equal to the total number of agglomerates divided 
by 24 or c = 1 17/24 = 4.875. The standard deviation is ^ = 2.208. The lower control limit is 
LCL = 4.875 - 3(2.208) = -1.749. Since it is negative, we take the lower limit to be 0. The upper limit is 
UCL = 4.875 + 3(2.208) or 11.499. An out of control condition is indicated for jumbo roll # 20 since the 
number of agglomerates, 13, exceeds the upper control limit, 11.499. 



18.15 This problem is a follow up to Problem 18.14. You should review 18.14 before attempting this 
problem. Table 18.12 gives the data for 20 Jumbo rolls. The table gives the roll number, the length 
of roll inspected for agglomerates, the number of inspection units (recall from Problem 18.14 that 
5 ft constitutes an inspection unit), the number of agglomerates found in the length inspected, and 
the number of agglomerates per inspection unit. The centerline for the [/-chart is u, the sum of 
column 4 divided by the sum of column 3. The standard deviation, however, changes from sample 
to sample and gives control limits consisting of stair-stepped control limits. The lower 
control limit for sample is LCL = u — 3^u/nj and the upper control limit for sample i is 
UCL = u+3y/u/nt. 



CHAP. 18] 



STATISTICAL PROCESS CONTROL AND PROCESS CAPABILITY 



499 



Table 18.12 



Jumbo roll # 


Length 
Inspected 


# of Inspection 
Units, n, 


#of 
Agglomerates 


u t = 
Col. 4/Col. 3 


1 


5.0 


1.0 


6 


6.00 


2 


5.0 


1.0 


4 


4.00 


3 


5.0 


1.0 


6 


6.00 


4 


5.0 


1.0 


2 


2.00 


5 


5.0 


1.0 


3 


3.00 


6 


10.0 


2.0 


8 


4.00 


7 


7.5 


1.5 


6 


4.00 


8 


15.0 


3.0 


6 


2.00 


9 


10.0 


2.0 


10 


5.00 


10 


7.5 


1.5 


6 


4.00 


11 


5.0 


1.0 


4 


4.00 


12 


5.0 


1.0 


7 


7.00 


13 


5.0 


1.0 


5 


5.00 


14 


15.0 


3.0 


8 


2.67 


15 


5.0 


1.0 


3 


3.00 


16 


5.0 


1.0 


5 


5.00 


17 


15.0 


3.0 


10 


3.33 


18 


5.0 


1.0 


1 


1.00 


19 


15.0 


3.0 




2.67 


20 


15.0 


3.0 


15 


5.00 



Use MINITAB to construct the control chart for this problem and determine if the process is 
in control. 

SOLUTION 

The centerline for the [/-chart is u, the sum of column 4 divided by the sum of column 3. The standard 
deviation, however, changes from sample to sample and gives control limits consisting of stair-stepped 
control limits. The lower control limit for sample i is LCL = u— I^Jujuj and the upper control limit 



10 




2 4 6 8 10 12 14 16 18 20 
Sample 

Tests performed with unequal sample sizes 

Fig. 18-12 [/-chart for agglomerates. 
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for sample i is UCL = u + 'isjufn i . The centerline for the above data is u = 123/33 = 3.73. The MINITAB 
solution is obtained by the pull-down sequence Stat — ► Control Charts — »U. 

The information, required by MINITAB to create the [/-chart, is that given in columns 3 and 4 of 
Table 18.12. The [/-chart for the data in Table 18.12 is shown in Fig. 18-12. The control chart does not 
indicate any out of control points. 



Supplementary Problems 

X-BAR AND R CHARTS 

18.16 The data from ten subgroups, each of size 4, is shown in Table 18.13. Compute X and R for each subgroup 
as well as, X, and R. Plot the lvalues on a graph along with the centerline corresponding to X. On 
another graph, plot the ^-values along with the centerline corresponding to R. 



Table 18.13 



Subgroup 


Subgroup Observations 


1 


13 


11 


13 


16 


2 


11 


12 


20 


15 


3 


16 


18 


20 


15 


4 


13 


15 


18 


12 


5 


12 


19 


11 


12 


6 


14 


10 


19 


16 


7 


12 


13 


20 


10 


8 


17 


17 


12 


14 


9 


15 


12 


16 


17 


10 


20 


13 


18 


17 



18.17 A frozen food company packages 1 pound (lb) packages (454 g) of green beans. Every two hours, 4 of the 
packages are selected and the weight is determined to the nearest tenth of a gram. Table 18.14 gives the data 
for a one-week period. 



Mon. 
10:00 



12:00 



2:0(1 



4:00 



10:00 



12:00 



4:0(1 



453.0 451.6 

454.5 455.0 

452.6 452.8 
451.8 453.5 



452.0 455.4 

451.5 453.0 

450.8 454.3 

454.8 450.6 



450.9 
455.0 
453.6 



452.6 
452.8 
455.5 



453.2 453.0 

455.8 451.4 

452.0 452.5 

453.5 452.1 



4:00 



Thur. 
10:00 



12:00 



4:00 



10:00 



12:00 



454.7 
451.4 
450.9 



451.1 452.2 454.0 

452.6 448.9 452.8 

448.5 455.3 455.5 

454.4 453.9 453.8 



455.7 
451.8 
451.2 



455.3 454.2 451.1 

452.4 452.9 453.8 
452.3 451.5 452.4 
452.3 455.8 454.3 
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Use the method discussed in Problem 18.4 to estimate the standard deviation by pooling the variances 
of the 20 samples. Use this estimate to find the control limits for an X-bar chart. Are any of the 20 subgroup 
means outside the control limits? 



18.18 The control limits for the R chart for the data in Table 18.14 are LCL = and UCL = 8.205. Are any of the 
subgroup ranges outside the 3 sigma limits? 



18.19 The process that fills the 1 lb packages of green beans discussed in Problem 18.17 is modified in hopes of 
reducing the variability in the weights of the packages. After the modification was implemented and in use 
for a short time, a new set of weekly data was collected and the ranges of the new subgroups were plotted 
using the control limits given in Problem 18.18. The new data are given in Table 18.15. Does it appear that 
the variability has been reduced? If the variability has been reduced, find new control limits for the X-bar 
chart using the data in Table 18.15. 



Mon. 
10:00 



Mon. 
12:00 



Mon. 
2:00 



4:00 



10:00 



12:00 



Tue. 
2:00 



4:00 



454.9 
452.7 
457.0 
454.2 



454.2 
453.6 
454.4 
453.9 



454.4 
453.6 
453.6 
454.3 



454.7 
453.9 
454.6 
453.9 



454.3 454.2 
454.2 452.8 
454.2 453.3 

453.4 453.3 



454.6 
454.5 
454.3 



453.6 
453.2 
453.6 
453.1 



Wed. 

2:00 



Wed. 
4:00 



Thur. 
12:00 



Thur. 
2:00 



Thur. 
4:00 



2:00 



453.0 453.9 

454.0 454.2 

452.9 454.3 

454.2 454.7 



453.8 
453.6 



453.3 
454.7 
453.9 



454.2 454.4 

453.0 452.6 

453.8 454.9 

453.9 454.2 



455.7 
452.8 
453.8 
454.9 



TESTS FOR SPECIAL CAUSES 



18.20 Operators making adjustments to machinery too frequently is a problem in industrial processes. Table 18.16 
contains a set of data (20 samples each of size 5) in which this is the case. Find the control limits for an X-bar 
chart and then form the X-bar chart and check the 8 tests for special causes given in Table 18.3. 



10 



1.996 2.012 1.991 
1.972 2.025 1.970 
2.006 2.027 2.001 



2.003 
2.024 
2.005 



2.001 
2.026 
2.014 



1.998 2.015 1.98: 

1.992 2.000 1.98: 

2.005 2.026 1.99. 

1.985 2.006 2.0H 

1.966 2.012 2.03 
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2.004 
2.000 
2.012 



1.991 
1.979 



2.002 
2.011 
2.002 
2.014 
2.013 



2.005 
1.999 



2.016 
1.999 



2.006 
2.007 



PROCESS CAPABILITY 

18.21 Suppose the specification limits for the frozen food packages in Problem 18.17 are LSL = 450g and 
USL = 458g. Use the estimates of u and a obtained in Problem 18.17 to find C PK . Also estimate the ppm 
not meeting the specifications. 

18.22 In Problem 18.21, compute C PK and estimate the ppm nonconforming after the modifications made in 
Problem 18.19 have been made. 



P- AND A^-CHARTS 

18.23 A Company produces fuses for automobile electrical systems. Five hundred of the fuses are tested 
per day for 30 days. Table 18.17 gives the number of defective fuses found per day for the 30 days. 
Determine the centerline and the upper and lower control limits for a P-chart. Does the process 
appear to be in statistical control? If the process is in statistical control, give a point estimate for the ppm 
defective rate. 



Table 18.17 



Day 


1 




3 


4 


5 


6 


7 


8 


9 


10 


# Defective 


3 


3 


3 


3 


1 


1 


1 


1 


6 


1 


Day 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


# Defective 


1 


1 


5 


4 


6 


3 


6 


2 


7 




Day 


21 


22 


23 


24 


25 


26 


27 


28 


29 


30 


# Defective 


2 


3 


6 


1 


2 


3 


1 


4 


4 


5 



18.24 Suppose in Problem 18.23, the fuse manufacturer decided to use an JVP-chart rather than a P-chart. Find the 
centerline and the upper and lower control limits for the chart. 



18.25 Scottie Long, the manager of the meat department of a large grocery chain store, is interested 
in the percentage of packages of hamburger meat that have a slight discoloration. Varying 
numbers of packages are inspected on a daily basis and the number with a slight discoloration is 
recorded. The data are shown in Table 18.18. Give the stair-stepped upper control limits for the 
20 subgroups. 
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Table 18.18 



Day 


Subgroup Size 


Number Discolored 


Percent Discolored 


! 


100 


1 


1.00 


2 


150 


1 


0.67 


3 


100 





0.00 


4 


200 


1 


0.50 


5 


200 


1 


0.50 


6 


150 





0.00 


7 


100 





0.00 


8 


100 





0.00 


9 


150 





0.00 


10 


200 




1.00 


11 


100 


1 


1.00 


12 


200 


1 


0.50 


13 


150 


3 


2.00 


14 


200 


2 


1.00 


15 


150 


1 


0.67 


16 


200 


1 


0.50 


17 


150 


4 


2.67 


18 


150 





0.00 


19 


150 





0.00 


20 


150 


2 


1.33 



OTHER CONTROL CHARTS 

18.26 Review Problem 18.11 prior to working this problem. Hourly readings of the temperature of an oven, used 
for bread making, are obtained for 24 hours. The baking temperature is critical to the process and the oven is 
operated constantly during each shift. The data are shown in Table 18.19. An individual chart is used to help 
monitor the temperature of the process. Find the centerline and the moving ranges corresponding to using 
adjacent pairs of measurements. How are the control limits found? 



Table 18.19 



Hour 








4 
















12 


Temperature 


350.0 


350.0 


349.8 


350.4 


349.6 


350.0 


349.7 


349.8 


349.4 


349.8 


350.7 


350.9 


Hour 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


24 


Temperature 


349.8 


350.3 


348.8 


351.6 


350.0 


349.7 


349.8 


348.6 


350.5 


350.3 


349.1 


350.0 



18.27 Review Problem 18.12 prior to working this problem. Use MINITAB to construct a EWMA chart 
for the data in Table 18.14. Using a target value of 454 g, what does the chart indicate concerning the 
process? 

18.28 Review the discussion of a zone chart in Problem 18.13 before working this problem. Construct a zone chart 
for the data in Table 18.16. Does the zone chart indicate any out of control conditions? What shortcoming of 
the zone chart does this problem show? 
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18.29 Work Problem 18.15 prior to working this problem. Construct the stair-stepped control limits for the u chart 
in Problem 18.15. 

18.30 A Pareto chart is often used in quality control. A Pareto chart is a bar graph that lists the defects that are 
observed in descending order. The most frequently occurring defects are listed first, followed by those that 
occur less frequently. By the use of such charts, areas of concern can be identified and efforts made to correct 
those defects that account for the largest percent of defects. The following defects are noted for respirator 
masks inspected during a given time period: Discoloration, loose strap, dents, tears, and pinholes. 
The results are shown in Table 18.20. 



Table 18.20 



discoloration 


discoloration 


discoloration 








discoloration 






discoloration 


strap 


discoloration 




discoloration 


discoloration 


discoloration 


discoloration 


dent 


discoloration 


dent 






pinhole 


discoloration 


dent 


discoloration 


pinhole 


discoloration 







Figure 18-13 is a Pareto chart generated by MINITAB. The data given in Table 18.20 are entered into 
column 1 of the worksheet. The pull-down menus needed to construct this chart are as follows: "Stat — > 
Quality tools — > Pareto charts." By referring to the Pareto chart, what type defect should receive the most 
attention? What two types of defects should receive the most attention? 



Pareto Chart of defect 




10 



defect 


discoloration 


strap 


dent 


tear 


pinhole 


Count 


14 


6 


4 


4 


2 


Percent 


46.7 


20.0 


13.3 


13.3 


6.7 




46.7 


66.7 


80.0 


93.3 


100.0 



Fig. 18-13 Respirator defects. 



Answers to 
Supplementary 
Problems 



CHAPTER 1 

1.46 (a) Continuous; (b) continuous; (c) discrete; (d) discrete; (e) discrete. 

1.47 (a) Zero upward; continuous, (b) 2, 3, ... ; discrete. 

(c) Single, married, divorced, separated, widowed; discrete, (d) Zero upward; continuous, 
(e) 0, 1, 2,...; discrete. 

1.48 (a) 3300; (b) 5.8; (c) 0.004; (d) 46.74; (e) 126.00; (/) 4,000,000; (g) 148; (h) 0.000099; (0 2180; (,/) 43.88. 

1.49 (a) 1,325,000; (/;) 0.0041872; (c) 0.0000280; (d) 7,300,000,000; (e) 0.0003487; (/) 18.50. 

1.50 (a) 3; (b) 4; (c) 7; (d) 3; (e) 8; (/) unlimited; (g) 3; (A) 3; (i) 4; (./) 5. 

1.51 (a) 0.005 million bu, or 5000 bu; three, (b) 0.000000005 cm, or 5 x 10~ 9 cm; four, (c) 0.5 ft; four. 

(d) 0.05 x 10 s m, or 5 x 10 6 m; two. (e) 0.5 mi/sec; six. (/) 0.5 thousand mi/sec, or 500mi/sec; three. 

1.52 (a) 3.17 x 10~ 4 ; (b) 4.280 x 10 8 ; (c) 2.160000 x 10 4 ; (d) 9.810 x 10~ 6 ; (e) 7.32 x 10 5 ; (/) 1.80 x 10" 3 . 

1.53 (a) 374; (b) 14.0. 

1.54 (a) 280 (two significant figures), 2.8 hundred, or 2.8 x 10 2 ; (b) 178.9; 

(c) 250,000 (three significant figures), 250 thousand, or 2.50 x 10 5 ; (d) 53.0; (e) 5.461; (/) 9.05; 

(g) 11.54; (/?) 5,745,000 (four significant figures), 5745 thousand, 5.745 million, or 5.745 x 10 6 ; (0 1.2; 

0)4157. 
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-11; (ft) 2; (c) f, or 4.375; (d) 21; (e) 3; (/) -16; (g) -M, or 9.89961 approximately; 
h) -7/\/34, or -1.20049 approximately; (/) 32; (J) 10/VT7, or 2.42536 approximately. 

» 22, 18, 14, 10, 6, 2, -2, -6, and -10; (ft) 19.6, 16.4, 13.2, 2.8, -0.8, -4, and -8.4; 

1.2, 30, 10 - A\fl = 4.34 approximately, and 10 + An = 22.57 approximately; 
d)3, 1, 5, 2.1, -1.5, 2.5, and 0; (e) X = \(W - Y). 

a) -5; (ft) -24; (c) 8. 
» -8; (ft) 4; (c) -16. 

4; (ft) 2; (c) 5; (d) |; (e) 1; (/) -7. 
» a = 3, ft = 4; (A) a = -2, b = 6; (c) X = -0.2, Y = -1.2; 

,d) A = ifl = 26.28571 approximately, 5 = ^ = 15.71429 approximately; (e) a = 2, b = 3, c = 
(/) X = —1, F = 3, Z = -2; (g) [/ = 0.4, L = -0.8, = 0.3. 

b) (2, -3); i.e., X = 2, Y = —3. 

» 2, -2.5; (b) 2.1 and -0.8 approximately. 



b) 2 and -2.5. 

c) 0.549 and -2.549 approximately. 

,_-8±V=36 -8±736V=n" -8 ±6/ .... , . ^-r 
= = — - — = -4 ± 3i, where ; = V- 1 . 

These roots are complex numbers and will not show up when a graphic procedure is employed. 
» -6.15 < -4.3 < -1.5 < 1.52 < 2.37; (b) 2.37 > 1.52 > -1.5 > -4.3 > -6.15. 
» 30 < N < 50; (Z>) S > 7; (c) -4 < X < 3; (c/) P < 5; (e) X - Y > 2. 

» X > 4; (ft) X > 3; (c) N < 5; (d) Y < 1; (e) -8 < X < 7; (/) -1.8 < N < 3; (g) 2 < a < 22. 
» l;(ft) 2; (c) 3; (d) -1; (e) -2. 

1.0000; (ft) 2.3026; (c) 4.6052; (rf) 6.9076; (e) -2.3026. 
» 1; (ft) 2; (c) 3; (d) 4; (e) 5. 

1.87 The EXCEL command is shown below the answer. 

1.160964 1.974636 2.9974102 1.068622 1.056642 
=LOG(5,4) =LOG(24,5) =LOG(215,6) =LOG ( 8 , 7 ) =LOG ( 9 , 8 ) 

1.88 > evalf (log[4] (5) ) ; 1.160964047 

> evalf (log[5] (24) ) ; 1.974635869 

> evalf (log [6] (215) ) ; 2.997410155 

> evalf (log[7] (8) ) ; 1.068621561 
> evalf (log[8] (9) ) ; 1.056641667 



= log(x) + logCv) + log(z) - 31og(w) 
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1.91 



51n(a) - 41n(Z>) + ln(c) + ln(d) = In . 
\og(u) + log(v) + log(w) - 21og(x) - 31og(j>) - 41og(z) = log | 





1.92 



1.93 



104/3. 



1.94 



2, -5/3. 



1.95 




1.96 



165.13. 



1.97 471.71. 



1.98 402.14. 



1.99 2.363. 



1.100 0.617. 



CHAPTER 2 

2.19 (b) 62. 

2.20 (a) 799; (b) 1000; (c) 949.5; (d) 1099.5 and 1199.5; (e) 100 (hours); (/) 76; 
(g) m = °- 155 > or 15 - 5% ; (*) 29 - 5% ; (0 19-0%; 0) 78.0%. 

2.25 (a) 24%; (6) 11%; (c) 46%. 

2.26 (a) 0.003 in; (b) 0.3195, 0.3225, 0.3255, . . . , 0.3375 in. 

(c) 0.320-0.322, 0.323-0.325, 0.326-0.328, . . ., 0.335-0.337 in. 

2.31 (a) Each is 5 years; (b) four (although strictly speaking the last class has no specified size); (c) one; 

(d) (85-94); (e) 7 years and 17 years; (/) 14.5 years and 19.5 years; (g) 49.3% and 87.3%; (h) 45.1%; 
(0 cannot be determined. 

2.33 19.3, 19.3, 19.1, 18.6, 17.5, 19.1,21.5,22.5,20.7, 18.3, 14.0, 11.4, 10.1, 18.6, 1 1.4, and 3.7. (These will not add 
to 265 million because of the rounding errors in the percentages.) 

2.34 (b) 0.295; (c) 0.19; (d) 0. 



CHAPTER 3 

3.47 (a) X, + X 2 + X 3 + Z 4 + 8 

(b) x\ + ~f 2 x\ + ux\ + uxl +f 5 x 2 5 

(c) [/,([/, + 6) + U 2 (U 2 + 6) + £/ 3 ([/ 3 + 6) 

(d) y 2 +yI + ---+y 2 n - an 

(e) 4X X Yi+4Y 2 Y 2 + 4X 3 Y 3 + 4X 4 Y 4 . 



3.48 (a)^(X ; + 3) 3 ; (6) - a) 2 ; (c) £ (2X J - 3F ; ); 
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3.51 (a) 20; (6) -37; (c) 53; (</) 6; (e) 226; (/) -62; fc) ff. 

3.52 (a) -1; (7>) 23. 

3.53 86. 

3.54 0.50 second. 

3.55 8.25. 

3.56 (a) 82; (b) 79. 

3.57 78. 

3.58 66.7% males and 33.3% females. 

3.59 11.09 tons. 

3.60 501.0. 

3.61 0.72642 cm. 

3.62 26.2. 

3.63 715 minutes. 

3.64 (b) 1.7349 cm. 

3.65 (a) Mean = 5.4, median = 5; (b) mean = 19.91, median = 19.85. 

3.66 85. 

3.67 0.51 second. 

3.68 8. 

3.69 11.07 tons. 

3.70 490.6. 

3.71 0.72638 cm. 

3.72 25.4. 

3.73 Approximately 78.3 years. 

3.74 35.7 years. 

3.75 708.3 minutes. 
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3.76 (a) Mean = 8.9, median = 9, mode = 7. 

(ft) Mean = 6.4, median = 6. Since each of the numbers 4, 5, 6, 8, and 10 occurs twice, we can consider these 
to be the five modes; however, it is more reasonable to conclude in this case that no mode exists. 

3.77 It does not exist. 

3.78 0.53 second. 

3.79 10. 

3.80 11.06 tons. 

3.81 462. 

3.82 0.72632 cm. 

3.83 23.5. 

3.84 668.7 minutes. 

3.85 (a) 35-39; (ft) 75 to 84. 

3.86 (a) Using formula (9), mode 

(b) Using formula (9), mode 

(c) Using formula (9), mode 
(<•/) Using formula (9), mode 

3.88 (a) 8.4; (b) 4.23. 

3.89 (a) G = 8; (ft) X = 12.4. 

3.90 (a) 4.14; (ft) 45.8. 

3.91 (a) 11.07 tons; (ft) 499.5. 

3.92 18.9%. 

3.93 (a) 1.01%; (ft) 238.2 million; (c) 276.9 million. 

3.94 $1586.87. 

3.95 $1608.44. 

3.96 3.6 and 14.4. 

3.97 (a) 3.0; (ft) 4.48. 

3.98 (a) 3; (ft) 0; (c) 0. 

3.100 (a) 11.04; (ft) 498.2. 

3.101 38.3 mi/h. 

3.102 (ft) 420 mi/h. 



= 11.1 Using formula (10), mode = 1 1 .03 

= 0.7264 Using formula (10), mode = 0.7263 

= 23.5 Using formula (10), mode = 23.8 

= 668.7 Using formula (10), mode = 694.9. 
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3.104 (a) 25; (b) 3.55. 

3.107 (a) Lower quartile = Q\ = 67, middle quartile = Q 2 = median = 75, and upper quartile = g 3 = 83. 

(b) 25% scored 67 or lower (or 75% scored 67 or higher), 50% scored 75 or lower (or 50% scored 75 or 
higher), and 75% scored 83 or lower (or 25% scored 83 or higher). 

3.108 (a) g, = 10.55 tons, Q 2 =11.07 tons, and Q 3 = 11.57 tons; (b) Q x = 469.3, Q 2 = 490.6, and g 3 = 523.3. 

3.109 Arithmetic mean, Median, Mode, Q 2 , P 50 , and D 5 . 

3.110 (a) 10.15 tons; (b) 11.78 tons; (c) 10.55 tons; (d) 11.57 tons. 
3.112 (a) 83; (b) 64. 

CHAPTER 4 

4.33 (a) 9; (b) 4.273. 

4.34 4.0 tons. 

4.35 0.0036 cm. 

4.36 7.88 kg. 

4.37 20 weeks. 

4.38 (a) 18.2; (b) 3.58; (c) 6.21; (d) 0; (e) sft = 1.414 approximately; (/) 1.88. 

4.39 (a) 2; (b) 0.85. 

4.40 (a) 2.2; (Z>) 1.317. 

4.41 0.576 ton. 

4.42 (a) 0.00437cm; (b) 60.0%, 85.2%, and 96.4%. 

4.43 (a) 3.0; (b) 2.8. 

4.44 (a) 31.2; (b) 30.6. 

4.45 (a) 6.0; (b) 6.0. 

4.46 4.21 weeks. 

4.48 (a) 0.51 ton; (b) 27.0; (c) 12. 

4.49 3.5 weeks. 

4.52 (a) 1.63 tons; (b) 33.6 or 34. 

4.53 The 10-90 percentile range equals $189,500 and 80% of the selling prices are in the range 
$130,250 ± $94,750. 

4.56 (a) 2.16; (b) 0.90; (c) 0.484. 
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4.58 45. 

4.59 (a) 0.733 ton; (b) 38.60; (c) 12.1. 

4.61 (a) X= 2.47; (/>).?= 1.11. 

4.62 .? = 5.2 and Range/4 = 5. 

4.63 (a) 0.00576cm; (/>) 72.1%, 93.3%, and 99.76%. 

4.64 (a) 0.719 ton; (b) 38.24; (c) 11.8. 

4.65 (a) 0.000569cm; (b) 71.6%, 93.0%, and 99.68%. 

4.66 (a) 146.81b and 12.91b. 

4.67 (a) 1 .7349 cm and 0.00495 cm. 

4.74 (a) 15; (b) 12. 

4.75 (a) Statistics; (b) algebra. 

4.76 (a) 6.6%; (b) 19.0%. 

4.77 0.15. 

4.78 0.20. 

4.79 Algebra. 

4.80 0.19, -1.75, 1.17, 0.68, -0.29. 



CHAPTER 5 

5.15 (a) 6; (b) 40; (c) 288; (rf) 2188. 

5.16 (a) 0; (/>) 4; (c) 0; (rf) 25.86. 

5.17 (a) -1; (/>) 5; (c) -91; (</) 53. 
5.19 0,26.25,0,1193.1. 

5.21 7. 

5.22 (a) 0, 6, 19, 42; (6) -4, 22, -117, 560; (c) 1, 7, 38, 155. 

5.23 0, 0.2344, -0.0586, 0.0696. 

5.25 (a) m x = 0; (ft) m 2 =pq; (c) m 3 = pq{q - p); (d) m 4 = pq(p 2 -pq + q 1 ). 

5.27 m, = 0, m 2 = 5.97, m 3 = -0.397, m 4 = 89.22. 

5.29 w, (corrected) = 0, m 2 (corrected) = 5.440, m 3 (corrected) = -0.5920, m 4 (corrected) = 76.2332. 
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5.30 (a) mi = 0, m 2 = 0.53743, m 3 = 0.36206, m 4 = 0.84914; 
(b) m 2 (corrected) = 0.51660, m 4 (corrected) = 0.78378. 

5.31 (a) 0; (b) 52.95; (c) 92.35; (d) 7158.20; (e) 26.2; (/) 7.28; (g) 739.58; (h) 22,247; (/) 706,428; 
(j) 24,545. 

5.32 (a) -0.2464; (b) -0.2464. 

5.33 0.9190. 

5.34 First distribution. 

5.35 (a) 0.040; (b) 0.074. 

5.36 (a) -0.02; (b) -0.13. 



5.37 



Pearson's coefficient of skewness 



First coefficient 
Second coefficient 



1.094 







-0.770 
-1.094 



5.38 (a) 2.62; (b) 2.58. 

5.39 (a) 2.94; (b) 2.94. 

5.40 (a) Second; (b) first. 

5.41 (a) Second; (b) neither; (c) first. 

5.42 (a) Greater than 1875; (7>) equal to 1875; (c) less than 1875. 

5.43 (a) 0.313. 



CHAPTER 6 

6.40 (a) !; (b) f 6 ; (c) 0.98; (</) f; (e) |. 

6.41 (a) Probability of a king on the first draw and no king on the second draw. 

(b) Probability of either a king on the first draw or a king on the second draw, or both. 

(c) No king on the first draw or no king on the second draw, or both (i.e., no king on the first and second 
draws). 

(d) Probability of a king on the third draw, given that a king was drawn on the first draw but not on the 
second draw. 

(<?) No king on the first, second, and third draws. 

(J) Probability either of a king on the first draw and a king on the second draw or of no king on the second 
draw and a king on the third draw. 

6.42 (a) i; (b) |; (c) % (d) f; (e) % 

6.43 (a) « ; (b) A- ( c ) if; (d) « ; (e) jl; (/) i; (g) %; (h) |i; (0 & (j) §- y 

6.44 (a) (b) h (c) li; id) #; ( c ) ji; (/) i; (g) f ; (A) |§; (/) (j) ft. 
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6.45 (a) & (ft) il ; (C ) ±. 

6.46 (a) g; (ft) (c) £ (</) $ (e) ^; (/) {§; (g) $ (A) ^ . 

6.47 ^. 

6.48 (a) 81:44; (A) 21:4. 

6.49 |f. 

6.50 (a) §• (ft) i; (c) ±; (</) ||. 

6.51 (a) 37.5%; (ft) 93.75%; (c) 6.25%; (</) 68.75%. 



6.52 (a) 


X 





2 


3 


4 




P(X) 











6.53 (a) i; (ft) (c) |; (rf) i. 





12 3 







6.55 (a) 





3 


4 






7 






10 


11 




13 


14 




16 


17 


18 




1 


3 


6 


10 


15 


21 


25 


27 


27 


25 


21 


15 


10 


6 


3 





*A11 the p(x) values have a divisor of 216. 



(ft) 0.532407. 

6.56 $9. 

6.57 $4.80 per day. 

6.58 A contributes $12.50; B contributes $7.50.c 

6.59 (a) 7; (ft) 590; (c) 541; (d) 10,900. 

6.60 (a) 1.2; (ft) 0.56; (c) %/CL56 = 0.75 approximately. 

6.63 10.5. 

6.64 (a) 12; (ft) 2520; (c) 720; 

(d) =PERMUT (4,2), =PERMUT (7,5), =PERMUT (10,3). 

6.65 n=5. 

6.66 60. 

6.67 (a) 5040; (ft) 720; (c) 240. 

6.68 (a) 8400; (ft) 2520. 

6.69 (a) 32,805; (ft) 11,664. 
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6.70 26. 

6.71 (a) 120; (ft) 72; (c) 12. 

6.72 (a) 35; (ft) 70; (c) 45; 

(d) =COMBIN(7,3) , =COMBIN(8,4) , =COMBIN ( 10 , 8 ). 

6.73 n = 6. 

6.74 210. 

6.75 840. 

6.76 (a) 42,000; (ft) 7000. 

6.77 (a) 120; (ft) 12,600. 

6.78 (a) 150; (ft) 45; (c) 100. 

6.79 (a) 17; (ft) 163. 
6.81 2.95 x 10 25 . 

6.83 (a) j^; (ft) jg; (c) ±g; (d) 3^. 

6.84 ^. 

6.85 (a) 0.59049; (ft) 0.32805; (c) 0.08866. 

6.86 (ft) |; (c) |. 

6.87 (a) 8; (ft) 78; (c) 86; (d) 102; (e) 20; (J) 142. 

6.90 i. 

6.91 1/3,838,380 (i.e., the odds against winning are 3,838,379 to 1). 

6.92 (a) 658,007 to 1; (ft) 91,389 to 1; (c) 9879 to 1. 

6.93 (a) 649,739 to 1; (ft) 71,192 to 1; (c) 4164 to 1; (d) 693 to 1. 

6.94 11. 

6.95 1 



6.96 



X 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


P (xy 


1 


3 


6 


10 


12 


12 


10 


5 


3 


1 



*A11 the p(x) values have a divisor of 64. 



6.97 7.5. 



6.98 70%. 
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6.99 (0.5)(0.01) + (0.3)(0.02) + (0.2)(0.03) = 0.017. 



CHAPTER 7 

7.35 (a) 5040; (b) 210; (c) 126; (d) 165; (e) 6. 

7.36 (a) q 1 + lq 6 p + 21a 5 / + 35a 4 / + 35a 3 / + 21a 2 / + Iqp 6 + / 

(b) q m + 10a 9 /; + 45a 8 / + 120a 7 / + 210c// + 252a 5 / + 210a 4 / + 120a 3 / + 45a 2 / + 10a/ + / 

7.37 (a) i; (b) (c) if; (J) ±; (e) i;(/)|;fe)i; 
(//) Probability Density Function 

Binomial withn=6 andp = 0.5 



X P (X = x ) 

0.015625 

1 0.093750 

2 0.234375 

3 0.312500 

4 0.234375 

5 0.093750 

6 0.015625 



7.38 (a)g;(6)|; 

(c) l-BINOMDIST( 1,6, 0.5,1) or 0.89062 5, =BINOMDIST ( 3 , 6 , . 5 , 1 ) = 0.6562 5. 

7.39 (a)J;(*)&;(c)&;(d)|. 

7.40 (a) 250 ; (Z>) 25 ; (c) 500. 

7.41 (fl)^;(A)^. 



7.44 (a )£;(b)^;(c)£ 3 ;(d)%; 
(e) 

a 0.131691 

i> 0.790128 

c 0.164606 

d 0.995885 

7.45 
7.47 



=BINOMDIST (5,5,0. 66667 , ) 
=l-BINOMDIST( 2, 5, 0.66667,1) 
=BINOMDIST (2,5,0. 66667 , ) 
= l-BINOMDIST( 0,5, 0.66667,0). 



(a) 42; (b) 3.550; (c) -0.1127; (d) 2.927. 
(a) Npq{q-p); (b) Npq{\ - 6pq) + lN 2 p 2 q 2 . 
(a) 1.5 and -1.6; (b) 72 and 90. 



7.50 (a) 75.4; (b) 9. 
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7.51 (a) 0.8767; (ft) 0.0786; (c) 0.2991; 



0.8767328 
0.0786066 
0.2991508 



=NORMSDIST (2.4 )-NORMSDIST (-1.2) 
=NORMSDIST ( 1 . 87 )-NORMSDIST (1.23) 
=NORMSDIST (-0.5 )-NORMSDIST (-2.35). 



7.52 (a) 0.0375; (ft) 0.7123; (c) 0.9265; (d) 0.0154; (e) 0.7251; if) 0.0395; 
(g) 

a 0.037538 =NORMSDIST ( -1 . 78 ) 

b 0.7122603 =NORMSDIST(0.56) 

c 0.9264707 =l-NORMSDIST ( -1 . 45 ) 

d 0.0153863 =l-NORMSDIST ( 2 . 16 ) 

e 0.7251362 =NORMSDIST ( 1 . 53 ) -NORMSDIST ( -0 . 8 ) 

f 0.0394927 =NORMSDIST ( -2 . 52 ) + ( 1-NORMSDIST ( 1 . 83 ) ). 



7.53 (a) 0.9495; (ft) 0.9500; (c) 0.6826. 

7.54 (a) 0.75; (ft) -1.86; (c) 2.08; (rf) 1.625 or 0.849; (e) ±1.645. 

7.55 -0.995. 

7.56 (a) 0.0317; (b) 0.3790; (c) 0.1989; 
(d) 

a 0.03174 =NORMDIST(2. 25, 0,1,0) 

£> 0.37903 =NORMDIST(-0. 32, 0,1,0) 

c 0.19886 =NORMDIST(-l. 18,0, 1,0). 

7.57 (a) 4.78%; (ft) 25.25%; (c) 58.89%. 

7.58 (a) 2.28%; (ft) 68.27%; (c) 0.14%. 

7.59 84. 

7.60 (a) 61.7%; (ft) 54.7%. 

7.61 (a) 95.4%; (ft) 23.0%; (c) 93.3%. 

7.62 (a) 1.15; (ft) 0.77. 

7.63 (a) 0.9962; (ft) 0.0687; (c) 0.0286; (d) 0.0558. 

7.64 (a) 0.2511; (ft) 0.1342. 

7.65 (a) 0.0567; (ft) 0.9198; (c) 0.6404; (rf) 0.0079. 

7.66 0.0089. 

7.67 (a) 0.04979; (ft) 0.1494; (c) 0.2241; (rf) 0.2241; (e) 0.1680; (/) 0.1008. 

7.68 (a) 0.0838; (ft) 0.5976; (c) 0.4232. 

7.69 (a) 0.05610; (ft) 0.06131. 

7.70 (a) 0.00248; (ft) 0.04462; (c) 0.1607; (d) 0.1033; (e) 0.6964; (/) 0.0620. 
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7.71 (a) 0.08208; (b) 0.2052; (c) 0.2565; (d) 0.2138; (e) 0.891 1; (J) 0.0142. 

7.72 (a) 3^; (b) & 

7.73 (a) 0.0348; (ft) 0.000295. 

7.74 i. 

7.75 p(X) = ( 4 X )(032) X (0.6&) 4 ~ x . The expected frequencies are 32, 60, 43, 13, and 2, respectively. 
7.76 

Histogram 

15 j 




4 8 12 16 20 

The histogram shows a skew to the data, indicating non-normality. 
The Shapiro-Wilt test of STATISTIX indicates non-normality. 

7.77 The STATISTIX histogram strongly indicates normality. 



Histogram 




Hours 
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The pull-down "Statistics =>■ Randomness/Normality Tests => Normality Probability Plot" leads to 
the following plot. If the data are from a normally distributed population, the points in the plot tend 
to fall along a straight line and P(W) tends to be larger than 0.05. If P(W)<Q.Q5, generally normality 
is rejected. 



Normal probability plot of hours 




Rankits 

Shapiro-Wilk W 0.9768 P(W) 0.7360 30 Ci 



The normal probability plot is shown above. The Shapiro-Wilk statistic along with the />-value 
is shown. P(W) = 0.7360. Since the p-value is considerably larger than 0.05, normality of the data is not 
rejected. 



7.78 The following histogram of the test 
it is called a U-shape distribution. 



n Table 7.11, produced by STATISTIX, has a U-shape. Hence, 



Histogram 




The pull-down "Statistics => Randomness/Normality Tests => Normality Probability Plot" leads to the 
following plot. If the data comes from a normally distributed population, the points in the plot tend to 
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fall along a straight line and P(W) tends to be larger than 0.05. If P(1T)<0.05, generally normality is 
rejected. 



Normal probability plot of scores 




-3-2-10123 
Rankits 

Shapiro-Wilk W 0.8837 P(W) 0.0034 30 cases 



The normal probability plot is shown above. The Shapiro-Wilk statistic along with the />-value is shown. 
P(W) = 0.0034. Since the p-value is less than 0.05, normality of the data is rejected. 

7.79 In addition to the Kolmogorov-Smirnov test in MINITAB and the Shapiro-Wilk test in STATISTIX, there 
are two other tests of normality that we shall discuss that are available for testing for normality. They are 
the Ryan-Joiner and the Anderson-Darling tests. Basically all four calculate a test statistic and each test 
statistic has a p-value associated with it. Generally, the following rule is used. If the />-value is <0.05, 
normality is rejected. The following graphic is given when doing the Anderson-Darling test. In this 
case, the />-value is 0.006 and I would reject the null hypothesis that the data came from a normal 
distribution. 
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Note that if the Ryan-Joiner test is used you do not reject normality. 

Probability plot of scores 
Normal 



c 60 
2 50 
o 40 




7.80 p(X) = 



(0- 6 1)V 



-. The expected frequE 



)8.7, 66.3, 20.2, 4.1, and 0.7, respectively. 



CHAPTER 8 

8.21 (a) 9.0; (b) 4.47; (c) 9.0; (d) 3.16. 

8.22 (a) 9.0; (b) 4.47; (c) 9.0; (d) 2.58. 

8.23 (a) nz = 22.40 g, = 0.008 g; (b) = 22.40 g, a x = slightly less than 0.008 g 

8.24 (a) n x = 22.40 g, a x = 0.008 g; (b) Hx = 22.40 g, a x = 0.0057 g. 

8.25 (a) 237; (A) 2; (c) none; (d) 34. 

8.26 (a) 0.4972; (b) 0.1587; (c) 0.0918; (d) 0.9544. 

8.27 (a) 0.8164; (b) 0.0228; (c) 0.0038; (rf) 1.0000. 

8.28 0.0026. 

8.34 (a) 0.0029; (b) 0.9596; (c) 0. 1446. 

8.35 (a) 2; (A) 996; (c) 218. 

8.36 (a) 0.0179; (b) 0.8664; (c) 0.1841. 

8.37 (a) 6; (A) 9; (c) 2; (rf) 12. 

8.39 (a) 19; (b) 125. 

8.40 (a) 0.0077; (b) 0.8869. 



ANSWERS TO SUPPLEMENTARY PROBLEMS 



521 



8.41 (a) 0.0028; (b) 0.9172. 

8.42 (a) 0.2150; (b) 0.0064, 0.4504. 

8.43 0.0482. 

8.44 0.0188. 

8.45 0.0410. 

8.47 (a) 118.79 g; (b) 0.74 g. 

8.48 0.0228. 

8.49 u. = 12 and a 2 = 10.8. 











first 


second 




probability 


6 


6 


6 


. 01 


6 


9 


7 . 5 


. 02 


6 


12 


9 


0.04 


6 


15 


10.5 


0.02 


6 


18 


12 


0.01 


9 


6 


7.5 


0.02 


9 


9 


9 


0.04 


9 


12 


10.5 


0.08 


9 


15 


12 


0.04 


9 


18 


13.5 


0.02 


12 


6 


9 


0.04 


12 


9 


10.5 


0.08 


12 


12 


12 


0. 16 


12 


15 


13.5 


0.08 


12 


18 


15 


0.04 


15 


6 


10.5 


0.02 


15 


9 


12 


0.04 


15 


12 


13.5 


0.08 


15 


15 


15 


0.04 


15 


18 


16.5 


0.02 


18 


6 


12 


0.01 


18 


9 


13.5 


0.02 


18 


12 


15 


0.04 


18 


15 


16.5 


0.02 


18 


18 


18 


0.01 



8.50 Probability distribution for x-bar with n = 2. 

D E F G H 

probability xbar p(xbar) 

0.01 6 0.01 D2 

0.02 7.5 0.04 D3+D7 
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P(xbar ) 
0.001 
0.006 
0.024 
0.062 



0.26 
0.2 



D4+D8+D12 

D5+D9+D13+D17 

D6+D10+D14+D18+D2 2 

D11+D15+D19+D23 

D16+D20+D24 

D21+D25 

D26 

SUM(G2:G10) 




8.51 Mean(x-bar) = 12 Var (x-bar) = 5.4. 
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10 0.123 

11 0.18 

12 0.208 

13 0.18 

14 0.123 

15 0.062 

16 0.024 

17 0.006 

18 0.001 



0.25 




CHAPTER 9 

9.21 (a) 9.5 kg; (b) 0.74 kg 2 ; (c) 0.78 kg and 0.86 kg, respectively. 

9.22 (a) 1200 h; (b) 105.4 h. 

9.23 (a) Estimates of population standard deviations for sample sizes of 30, 50, and 100 tubes are 101.7 h, 101.0 h, 

and 100.5 h, respectively; estimates of population means are 1200 h in all cases. 

9.24 (a) 11.09 ±0.18 tons; (b) 1 1.09 ± 0.24 tons. 

9.25 (a) 0.72642 ± 0.000095 in; (b) 0.72642 ± 0.000085 in; (c) 0.72642 ± 0.000072 in; (d) 0.72642 ± 0.000060 in. 

9.26 (a) 0.72642 ± 0.000025 in; (b) 0.000025 in. 

9.27 (a) At least 97; (b) at least 68; (c) at least 167; (d) at least 225. 

9.28 80% CI for mean: (286.064, 322.856). 

9.29 (a) 2400 ± 45 lb, 2400 ± 59 lb; (b) 87.6%. 
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9.31 97.5% CI for difference: (0.352477, 0.421523). 

9.32 (a) 16,400; (b) 27,100; (c) 38,420; (d) 66,000. 

9.33 (a) 1.07 ± 0.09 h; (b) 1.07 ± 0.12 h. 

9.34 85% CI for difference: (-7.99550, -0.20948). 

9.35 95% CI for difference: (-0.0918959, -0.00610414). 

9.36 (a) 180 ± 24.9 lb; (b) 180 ± 32.8 lb; (c) 180 ± 38.2 lb. 
9.37 

One Sample Chi-square Test for a Variance 
Sample Statistics for volume 

N Mean Std. Dev. Variance 

20 180.65 1.4677 2.1542 

99% Confidence Interval for the variance 
Lower Limit Upper Limit 
1.06085 5.98045 

9.38 

Two Sample Test for variances of units within line 
Sample Statistics 
line 

Group N Mean Std. Dev. Varinace 

1 13 104.9231 12.189 148.5769 

2 15 101.2667 5.5737 31.06667 
95% Confidence Interval of the Ratio of Two Variances 

Lower Limit Upper Limit 

1.568 15.334 



CHAPTER 10 
10.29 (a) 0.2604. 

(b) 0.1302 



a = 0.1302 + 0.1302. 
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10.30 (a) Reject the null hypothesis if X< 21 or X> 43, where X= number of red marbles drawn; (b) 0.99186; 
(c) Reject ifX<23 or X>41. 

10.31 (a) H Q . p = 0.5 H. d : p>0.5; (b) One-tailed test; (c) Reject the null hypothesis if Z>40; (d) Reject the null 

hypothesis if X> 41. 

10.32 (a) Two-tailed />-value=2* ( 1-BINOMDIST ( 22 , 100, 0. 16666, 1) = 0.126>0.05. Do not 

reject the null at the 0.05 level. 
(b) One-tailed p-value = 1-BINOMDIST (22, 100 , 0. 16 6 6 6, 1) = 0.063 > 0.05. Do not reject 
the null at the 0.05 level. 

10.33 Using either a two-tailed or a one-tailed test, we cannot reject the hypothesis at the 0.01 level. 

10.34 H Q :p> 0.95 H. d : p<0.95 

Rvalue = P{X< 182 from 200 pieces when /? = 0.95} =BINOMDIST (182, 200,0. 95, 1)= 0.0 12. 
Fail to reject at 0.01 but reject at 0.05. 

10.35 Test statistic = 2.63, critical values ±1.7805, reject the null hypothesis. 

10.36 Test statistic = -3.39, 0.10 critical value = -1.28155, 0.025 critical value = -1.96. The result is significant at 
a = 0.10 and a = 0.025. 

10.37 Test statistic = 6.46, critical value= 1.8808, conclude that u>25.5. 

10.38 =NORMSINV(0.9) = 1.2815 for a = 0.1, =NORMSINV ( . 99 ) =2.3263 for a = 0.01 and 
=NORMSINV (0 . 9 9 9) =3.0902 for a = 0.001. 

10.39 p-value = P{X< 3} + P{X> 12} = 0.0352. 

10.40 rvalue = P{Z< -2.63} + P{Z> 2.63} = 0.0085. 

10.41 rvalue = P{Z< -3.39} = 0.00035. 

10.42 p-value = P{Z> 6.46} = 5.235 1 5E- 1 1 . 

10.43 (a) 8.64 ± 0.96 oz; (b) 8.64 ± 0.83 oz; (c) 8.64 ± 0.63 oz. 

10.44 The upper control limits are (a) 12 and (b) 10. 

10.45 Test statistic = -5.59, /7-value = 0.000. Reject the null since the p-value<a. 

10.46 Test statistic = -1. 58, />-value = 0.059. Do not reject for a = 0.05, do reject for a = 0.10. 

10.47 Test statistic = -1. 73, p- value = 0.042. Reject for a = 0.05, do not reject for a = 0.01. 

10.48 A one-tailed test shows that the new fertilizer is superior at both levels of significance. 

10.49 (a) Test statistic = 1.35, /?-value = 0.176, unable to reject the null at a = 0.05. 
(b) Test statistic = 1.35, Rvalue = 0.088, unable to reject the null at a = 0.05. 

10.50 (a) Test statistic = 1.81, /?-value = 0.07, unable to reject the null at a = 0.05. 
(b) Test statistic = 1.81, /?-value = 0.0035, reject the null at a = 0.05. 

10.51 =1-BINOMDIST(10,15,0.5,1) or 0.059235. 
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10.52 =BINOMDIST(2,2 0,0.5,1) + 1-BINOMDIST ( 17 , 20 , . 5 , 1 ) or 0.000402. 

10.53 =BINOMDIST (10,15,0.6,1) or 0.7827. 

10.54 =BINOMDIST(17,20,0.9,1)-BINOMDIST(2,20,0.9,1) or 0.3231. 

10.55 p-value is given by =l-BINOMDIST (9,15,0.5,1) which equals 0.1509. Do not reject the null 
because the value of a = 0.0592 and the p-value is not less then a. 

10.56 a=BINOMDIST(4, 20, 0.5,1) +1-BINOMDIST ( 15 , 20 , . 5 , 1 ) or0.0118 

p-value =BINOMDIST(3,20,0.5,1) +1-BINOMDIST ( 16 , 20 , . 5 , 1 ) or 0.0026. 
Reject the null because the /j-value < a. 

10.57 =1-BINOMDIST(3,30,0.03,1) or 0.0119. 

10.58 =BINOMDIST(3,30,0.04,1) or 0.9694. 

10.59 a = l-BINOMDIST (5, 20,0. 16667, 1) or 0.1018 
rvalue =l-BINOMDIST (6,20,0. 16667 , 1 ) or 0.0371. 

CHAPTER 11 

11.20 (a) 2.60; (b) 1.75; (c) 1.34; (d) 2.95; (e) 2.13. 

11.21 (a) 3.75; (b) 2.68; (c) 2.48; (d) 2.39; (e) 2.33. 

(a)=TTNV(0.02,4) or 3.7469; (ft) 2.6810; (c) 2.4851; (rf) 2.3901; (e) 3.3515. 

11.22 (a) 1.71; (b) 2.09; (c) 4.03; (d) -0.128. 

11.23 (a) 1.81; (b) 2.76; (c) -0.879; (d) -1.37. 

11.24 (a) ±4.60; (b) ±3.06; (c) ±2.79; (d) ±2.75; (e) ±2.70. 

11.25 (a) 7.38 ±0.79; (6)7.38 ±1.11; 

(c) (6.59214, 8.16786) (6.26825, 8.49175). 

11.26 (a) 7.38 ± 0.70; (b) 7.38 ± 0.92. 

11.27 (a) 0.298 ± 0.030 second; (b) 0.298 ± 0.049 second. 

11.28 A two-tailed test shows that there is no evidence at either the 0.05 or 0.01 level to indicate that the mean 
lifetime has changed. 

11.29 A one-tailed test shows no decrease in the mean at either the 0.05 or 0.01 level. 

11.30 A two-tailed test at both levels shows that the product does not meet the required specifications. 

11.31 A one-tailed test at both levels shows that the mean copper content is higher than the specifications require. 

11.32 A one-tailed test shows that the process should not be introduced if the significance level adopted is 0.01 but 
it should be introduced if the significance level adopted is 0.05. 

11.33 A one-tailed test shows that brand A is better than brand B at the 0.05 significance level. 
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11.34 Using a two-tailed test at the 0.05 significance level, we would not conclude on the basis of the samples that 
there is a difference in acidity between the two types. 

11.35 Using a one-tailed test at the 0.05 significance level, we would conclude that the first group is not superior to 
the second. 

11.36 (a) 21.0; (b) 26.2; (c) 23.3; (rf)=CHIINV( 0.05, 12) or 21.0261 =CHIINV ( . 01 , 12 ) or 26.2170 

=CHIINV( 0.025, 12) or 23.3367. 

11.37 (a) 15.5; (b) 30.1; (c) 41.3; (d) 55.8. 

11.38 (a) 20.1; (b) 36.2; (c) 48.3; (d) 63.7. 

11.39 (a) X 2 \ = 9.59, and X 2 = 34.2. 

11.40 (a) 16.0; (b) 6.35; (c) assuming equal areas in the two tails, xi = 2.17 and xl = 14.1. 

11.41 (a) 87.0 to 230.9 h; (b) 78.1 to 288.5 h. 

11.42 (a) 95.6 to 170.4 h; (b) 88.9 to 190.8 h. 

11.43 (a) 122.5; (b) 179.2; (c) =CHIINV (0.95, 150) or 122.6918; (d) =CHIINV ( . 05 , 150 ) or 

179.5806. 

11.44 (a) 207.7; (b) 295.2; (c) =CHIINV (0.975,250) or 208.0978; (a) =CHIINV ( . 025 , 250 ) or 

295.6886. 

11.46 (a) 106.1 to 140.5 h; (b) 102.1 to 148.1 h. 

11.47 105.5 to 139.6 h. 

11.48 On the basis of the given sample, the apparent increase in variability is not significant at either level. 

11.49 The apparent decrease in variability is significant at the 0.05 level, but not at the 0.01 level. 

11.50 (a) 3.07; (b) 4.02; (c) 2.11; (d) 2.83. 

11.51 (a) =FINV(0.05,8, 10) or 3.0717; (b) =FINV ( . 01 , 24 , 11 ) or 4.0209; 
(c)=FINV(0. 05, 15,24) or 2.1077; (d) =FINV ( . 01 , 20 , 22 ) or 2.8274. 

11.52 The sample 1 variance is significantly greater at the 0.05 level, but not at the 0.01 level. 

11.53 (a) Yes; (b) no. 

CHAPTER 12 

12.26 The hypothesis cannot be rejected at either level. 

12.27 The conclusion is the same as before. 

12.28 The new instructor is not following the grade pattern of the others. (The fact that the grades happen to be 
better than average may be due to better teaching ability or lower standards, or both.) 
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12.29 There is no reason to reject the hypothesis that the coins are fair. 

12.30 There is no reason to reject the hypothesis at either level. 

12.31 (a) 10, 60, and 50, respectively; 

(b) the hypothesis that the results are the same as those expected cannot be rejected at the 0.05 level. 

12.32 The difference is significant at the 0.05 level. 

12.33 (a) The fit is good; (b) no. 

12.34 (a) The fit is "too good"; (b) the fit is poor at the 0.05 level. 

12.35 (a) The fit is very poor at the 0.05 level; since the binomial distribution gives a good fit of the data, this is 

consistent with Problem 12.33. 
(b) The fit is good, but not "too good." 

12.36 The hypothesis can be rejected at the 0.05 level, but not at the 0.01 level. 

12.37 The conclusion is the same as before. 

12.38 The hypothesis cannot be rejected at either level. 

12.39 The hypothesis cannot be rejected at the 0.05 level. 

12.40 The hypothesis can be rejected at both levels. 

12.41 The hypothesis can be rejected at both levels. 

12.42 The hypothesis cannot be rejected at either level. 

12.49 (a) 0.3863 (unconnected), and 0.3779 (with Yates' correction). 

12.50 (a) 0.2205, 0.1985 (corrected); (b) 0.0872, 0.0738 (corrected). 

12.51 0.4651. 

12.54 (a) 0.4188, 0.4082 (corrected). 

12.55 (a) 0.2261, 0.2026 (corrected); (b) 0.0875, 0.0740 (corrected). 

12.56 0.3715. 

CHAPTER 13 

13.24 (a) 4; (b) 6; (c) f ; (d) 10.5; (e) 6; (/) 9. 

13.25 (2, 1). 

13.26 (a) 2X + Y = 4; (b) X intercept = 2, Y intercept = 4; (c) -2, -6. 

13.27 Y = \X-1, or2X-3F = 9. 
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13.28 (a) Slope = f, Y intercept = —4; (ft) 3X — 5Y = 11. 

13.29 (a) - f, (ft) f ; (c) 4X + 3 Y = 32. 

13.30 X/3+ F/(-5) = 1, or 5X-3Y = 15. 

13.31 (a) °F = § °C + 32; (ft) 176°F; (c) 20°C. 

13.32 (a) Y = — i + f JST, or T = -0.333 + 0.714X; (ft) X = 1 + f y, or X = 1.00+ 1.29 Y. 

13.33 (a) 3.24; 8.24; (b) 10.00. 

13.35 (ft) 7 = 29.13 + 0.6611"; (c) 1" = -14.39 + 1.15 F; (d) 79; (e) 95. 

13.36 (a) and (ft) 

Trend analysis plot for birthrate 
Linear trend model 
V = 14.3714-0.0571429*? 

14.4 

14.3 
iS 14.2 
m 14.1 

14.0 

13.9 

1 2 3 4 5 6 7 





14.3143 
14.2571 
14.2000 
14.1429 
14.0857 
14.0286 
13.9714 



Residual 

-0.014286 
-0.057143 
0.200000 
-0.042857 
-0.185714 
0.071429 
0.028571 



(d) 14.3714-0.0571429(13)= 13.6. 
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13.37 (a) and (b) 



Trend analysis plot for number 
Linear trend model 
V ,= 3951 .43 +156.357*? 




Residual 



1 999 
2000 
200 1 
2002 
2003 
2004 
2005 



4154 
4240 
4418 
4547 
4716 
4867 
5096 



4107.79 
4264.14 
4420.50 
4576.86 
4733.21 
4889.57 
5045.93 



46.2143 
-24.1429 

-2.5000 
-29.8571 
-17.2143 
-22.5714 

50.0714 



(d) 3951.43 + 156.357(12)= 5827.7. 



13.38 Y = 5.51 + 3.20(X - 3) + 0.733(Z - 3) 2 , or Y = 2.51 - 1.20X + 0.733X 2 . 



13.39 (b)D = 41.77 - 1.096 F + 0.08786 K"; (c) 170 ft, 516 ft. 
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§ ~ 2 
I -3 



1 2 3 4 5 6 7 

Coded-time 



(b) Fitted quadratic plot ( c ) Fitted cubic plot 

Difference =1.215- 3.098 coded-time Difference = 0.61 24 - 1 .321 coded-time 
+ 0.3345 coded-time**2 - 0.4008 coded-time**2 + 0.07539 coded-time**3 




(d) 





Linear Model 


Quadratic Model 


Cubic Model 


Year 


Residual 


Fitted 


Residual 


Fitted 


Residual 


Fitted 


1940 


1.363 


0.863 


-0.715 


1.215 


-0.112 


0.612 


1950 


0.836 


1.736 


0.649 


-1.549 


0.134 


-1.034 


1960 


-0.092 


0.092 


0.943 


-3.643 


0.330 


-3.030 


1970 


-1.919 


1.919 


-0.332 


-5.068 


-0.476 


-4.924 


1980 


-2.047 


2.047 


-0.575 


-5.825 


-0.139 


-6.261 


1990 


-1.074 


1.074 


-0.388 


-5.912 


0.292 


-6.592 


2000 


0.798 


6.098 


0.030 


-5.330 


0.162 


-5.462 


2005 


2.134 


6.534 


0.388 


-4.788 


-0.191 


-4.209 




SSQ = 


16.782 


SSQ 


= 2.565 


SSQ 


= 0.533 



0) 

Linear: -0.863 - 0.8725(7) = -6.97 

Quadratic: 1.215 - 3.098(7) + 0.3345(7 2 ) = - 4.08 

Cubic: 0.6124 - 1.321(7) - 0.4008(49) + 0.0754(343) = -2.41. 
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13.41 (b) Ratio = 0.965 + 0.0148 Year-coded. 



Fitted 



60.64 
65.61 



0.98 
0.99 



0.98 
0.99 



-0.00 
-0.00 



110.05 
121.24 



116.49 
127.47 



0.00 
-0.02 



!. Actual ratio = 1.04. 



13.42 (b) Difference = -2.63 + 1.35 x + 0.0064 xsquare. 

(d) The predicted difference for 1995 is -2.63 + 1.35(7.5) + 0.0064(56.25) = 7.8> 



13.43 (b) Y = 32.14(1.427)* or Y = 32.14(10) 01544Jr , . 
logarithmic base, 
(rf) 387. 



,x , where e = 2.718 • • • is the natural 



CHAPTER 14 

14.40 (b) Y = 4.000 + 0.500Z; (c) X = 2.408 + 0.612F. 

14.41 (a) 1.304; (b) 1.443. 

14.42 (a) 24.50; (b) 17.00; (c) 7.50. 

14.43 0.5533. 

14.44 The EXCEL solution is =CORREL(A2: All, B2: Bll) or 0.553. 

14.45 1.5. 

14.46 (a) 0.8961; (b) Y = 80.78 + 1.1 38Z; (c) 132. 

14.47 (a) 0.958; (b) 0.872. 

14.48 (a) Y = 0.8^+12; (b) X = 0.45 7+1. 
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14.49 (a) 1.60; (b) 1.20. 

14.50 ±0.80. 

14.51 75%. 

14.53 You get the same answer for both parts, namely -0.9203. 

14.54 (a) Y= 18.04- 1.34X, Y = 51.18 — 2.01X. 

14.58 0.5440. 

14.59 (a) Y = 4.44X - 142.22; (b) 141.9 lb and 177.5 lb, respectively. 

14.60 (a) 16.92 lb; (b) 2.07 in. 

14.62 Pearson correlation of C 1 and C 2 = 0.957. 

14.63 Pearson correlation of C 1 and C 2 = 0.582. 

14.64 (a) Yes; (b) no. 

14.65 (a) No ; (b) yes. 

14.66 (a) 0.2923 and 0.7951; (7j) 0.1763 and 0.8361. 

14.67 (a) 0.3912 and 0.7500; (b) 0.3146 and 0.7861. 

14.68 (a) 0.7096 and 0.9653; (6) 0.4961 and 0.7235. 

14.69 (a) Yes; (ft) no. 

14.70 (a) 2.00 ± 0.21; (/>) 2.00 ± 0.28. 

14.71 (a) Using a one-tailed test, we can reject the hypothesis. 
(b) Using a one-tailed test, we cannot reject the hypothesis. 

14.72 (a) 37.0 ± 3.28; (b) 37.0 ± 4.45. 

14.73 (a) 37.0 ± 0.69; (b) 37.0 ± 0.94. 

14.74 (a) 1.138 ± 0.398; (6) 132.0 ± 16.6; (c) 132.0 ± 5.4. 

CHAPTER 15 

15.26 (a) Jf 3 = b 3A2 + b 3L2 Xt + b 32A X 2 ; (b) X A = b 4A235 + 641.235^1 + 642.135*2 + 643.125*3- 

15.28 (a) Z 3 = 61.40 - 3.65X, + 2.54Z 2 ; (b) 40. 

15.29 (a) Z 3 - 74 = 4.36(Z, - 6.8) + 4.04{X 2 - 7.0), or X 3 = 16.07 + 4.36X, + 4.04Z,; (b) 84 and 66. 

15.30 The output is abridged in all cases. 
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EXCEL 



Multiple R 
R Square 
Adjusted R Squa: 
Standard Error 



Bedrooms 
Baths 



. 877519262 
0.770040055 
0.704337213 
25.62718211 
10 



32.94827586 
28.64655172 
29.28448276 



MINITAB 

Regression Analysis: Price versus Bedrooms, Baths 

The regression equation is 

Price = 32. 9 + 28. 6 Bedrooms + 2 9 . 3 Baths 

R-Sq=77.0% R-Sq(adj) =70.4% 



Root MSE 25.62718 
Dependent Mean 218.20000 
CoeffVar 11.74481 



R-Square 0.7700 
Adj R-Sq 0.7043 



Parameter Estimates 



Variable 
Intercept 
Bedrooms 



Label 

Intercept 

Bedrooms 



32 . 94828 
28.64655 
29.28448 



39.24247 
9.21547 
10.90389 



t Value 
. 84 
3.11 
2.69 



Pr > | t | 
0.4289 
0.0171 
0.0313 



SPSS 



Model 


Unstandardized 


Standardized 
Coefficients 




Sig. 


B 


Std. Error 


Beta 


1 (Constant) 


32.948 


39.242 




.840 


.429 


Bedrooms 


28.647 


9.215 


.587 


3.109 


.017 


Baths 


29.284 


10.904 


.507 


2.686 


.031 



a Dependent Variable: Price 



ANSWERS TO SUPPLEMENTARY PROBLEMS 



535 



STATISTIX 

Statistix 8 . 

Unweighted Least Squares Linear Regression of Price 
Predictor 

Variables Coefficients Std Error T P VIF 

Constant 32.9483 39.2425 0.84 0.4289 

Bedrooms 28.6466 9.21547 3.11 0.0171 1.1 

Baths 29.2845 10.9039 2.69 0.0313 1.1 

R-Squared 0.7700 

Estimated Price = 32.9 + 28.6(5) + 29.3(4) = 293.1 thousand. 

15.31 3.12. 

15.32 (a) 5.883; (b) 0.6882. 

15.33 0.9927. 

15.34 (a) 0.7567; (b) 0.7255; (c) 0.6810. 

15.37 The STATISTIX pull down "Statistics => Linear models => Partial Correlations" is used. The following 
dialog box is filled as shown. 




The following output results. 
Statistix 8 . 

Partial Correlations with XI 
Controlled for X3 

X2 0.5950 
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Similarly, it is found that: 
Statistix 8 . 

Partial Correlations with XI 
Controlled for X2 

X3 -0.8995 
and 

Statistix 8 . 

Partial Correlations with X2 
Controlled for XI 

X3 0.8727 
15.38 (a) 0.2672; (b) 0.5099; (c) 0.4026. 

15.42 (a) X 4 = 6X X + 3X 2 - 4X 3 - 100; (b) 54. 

15.43 (a) 0.8710; (b) 0.8587; (c) -0.8426. 

15.44 (a) 0.8947; (b) 2.680. 

15.45 Any of the following solutions will give the same coefficients as solving the normal equations. The pull-down 
"Tools =>■ Data analysis => Regression" is used in EXCEL to find the regression equation as well as other 
regression measurements. The part of the output from which the regression equation may be found is as 
follows. 



Coefficients 



Intercept 


-25.3355 


smoker 


-302.904 


Alcohol 


-4.57069 


Exercise 


-60.8839 


Dietary 


-36.8586 


Weight 


16.76998 


Age 


-9.52833 



In MINITAB the pull-down "Stat => Regression => Regression" may be used to find the regression 
equation. The part of the output from which the regression equation may be found is as follows. 



Regression Analysis: Medcost versus smoker, alcohol, . . . 

The regression equation is 

Medcost = — 25 — 303 smoker — 4.6 alcohol — 60.9 Exercise — 37 Dietary +16.8 weight 
- 9.53 Age 

Predictor Coef SE Coef T P 

Constant -25.3 644.8 -0.04 0.970 

smoker -302.9 256.1 -1.18 0.271 

alcohol -4.57 11.89 -0.38 0.711 

Exercise -60.88 19.75 -3.08 0.015 
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Dietary -36.9 104.0 -0.35 0.732 

weight 16.770 3.561 4.71 0.002 

Age -9.528 9.571 -1.00 0.349 



In SAS, the pull-down "Statistics => Regression => Linear" may be used to find the regression equation. 
The part of the output from which the regression equation may be found is as follows. 

The REG Procedure 
Model: MODEL 1 
Dependent Variable: Medcost Medcost 
Number of Observations Read 16 
Number of Observations Used 15 
Number of Observations with Missing Values 1 
Root MSE 224.41971 R-Square 0.9029 

Dependent Mean 2461.80000 Adj R-Sq 0.8301 

Coef f Var 9.11608 



Parameter Estimates 



Variable 


Label 


DF 


Parameter 


Standard 




Pr > | t | 








Estin 


late 






Intercept 


Intercept 


1 


-25 


33552 


644.84408 


-0.04 


0.9696 


smoker 


smoker 


1 


-302 


90395 


256.11003 


-1.18 


0.2709 


alcohol 


alcohol 


1 


-4 


57069 


11.88579 


-0.38 


0.7106 


Exercise 


Exercise 


1 


-60 


88386 


19.75371 


-3 .08 


0.0151 


Dietary 


Dietary 


1 


-36 


85858 


104 . 04736 


-0.35 


0.7323 


weight 


weight 


1 


16 


76998 


3 .56074 


4.71 


0.0015 






1 


-9 


52833 


9 .57104 


-1.00 


0.3486 



In SPSS, the pull-down "Analyze => Regression => Linear" may be used to find the regression equation. 
The part of the output from which the regression equation may be found is as follows. Look under the 
unstandardized coefficients. 



Coefficients 3 



Model 


Unstandardized 
Coefficients 


Standardized 
Coefficients 


t 


Sig. 


B 


Std. Error 


Beta 


1 (Constant) 


-25.336 


644.844 




-.039 


.970 


smoker 


-302.904 


256.110 


-.287 


-1.183 


.271 


alcohol 


-4.571 


11.886 


-.080 


-.385 


.711 


exercise 


-60.884 


19.754 


-.553 


-3.082 


.015 


dietary 


-36.859 


104.047 


-.044 


-.354 


.732 


weight 


-16.770 


3.561 


.927 


4.710 


.002 


age 


-9.528 


9.571 


-.124 


-.996 


.349 



^Dependent Variable: medcost 



In STATISTIX, the pull-down "Statistics => Linear models => Linear regression" may be used to find the 
regression equation. The part of the output from which the regression equation may be found is as follows. 
Look under the coefficients column. 
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Statistix 8 . 

Unweighted Least Squares Linear Regression of Medcost 



Predictor 
















Variables 


Coefficient 


Std Error 




T 




p 


VIF 


Constant 


-25.3355 


644 . 844 


-0. 


.04 





.9696 




Age 


9.52833 


9.57104 


-1. 


.00 





.3486 


1.3 


Dietary 


36.8586 


104 . 047 


-0. 


.35 





.7323 


1.3 


Exercise 


60.8839 


19.7537 


-3. 


.08 





.0151 


2.6 


alcohol 


-4.57069 


11.8858 


-0. 


.38 





.7106 


3.6 


smoker 


-302.904 


256.110 


-1. 


.18 





.2709 


4.9 


weight 


16.7700 


3 .56074 


4 


.71 





.0015 


3.2 



CHAPTER 16 

16.21 There is no significant difference in the five varieties at the 0.05 or 0.01 levels. The MINITAB analysis is 
as follows. 

One-way ANOVA: A, B, C, D, E 

Source DF SS MS F P 

Factor 4 27.2 6.8 0.65 0.638 

Error 15 157.8 10.5 
Total 19 185.0 

S = 3.243 R-Sq= 14.71% R-Sq (adj ) = . 00% 

Individual 95% CIs For Mean Based on 
Pooled StDev 

Level N Mean StDev + + + + 

A 4 16.500 3.697 ( * ) 

B 4 14.500 2.082 ( * ) 

C 4 17.750 3.862 ( * ) 

D 4 16.000 3.367 ( * ) 

E 4 17.500 2.887 ( * ) 

12.0 15.0 18.0 21.0 

16.22 There is no difference in the four types of tires at the 0.05 or 0.01 level. The STATISTIX analysis is as follows. 
Statistix 8 . 

Completely Randomized AOV for Mileage 

Source DF SS MS F P 

Type 3 77.500 25.8333 2.39 0.0992 

Error 20 216.333 10.8167 

Total 23 293.833 

Grand Mean 34.083 CV 9.65 

Chi-Sq DF P 

Bartlett ' s Test of Equal Variances 4.13 3 0.2476 

Cochran's Q 0.5177 
Largest Var / Smallest Var 6.4000 
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Component of variance for between groups 2 .50278 

Effective cell size 6.0 

Type Mean 

A 35.500 

B 36.000 

C 33.333 

D 31.500 

16.23 There is a difference in the three teaching methods at the 0.05 level, but not at the 0.01 level. The EXCEL 



analysis is a 


follows: 








Methodl 


Methodll 


Methodlll 




75 


81 




73 




62 


85 




79 




71 


68 




60 




58 


92 




75 




73 


90 




81 




Anova: Sing 


e Factor 








SUMMARY 










Groups 


Count 


Sum 


Average 


Variance 


Methodl 


5 


339 


67.8 


54.7 


Methodll 


5 


416 


83.2 


90.7 


Methodlll 


5 


368 


73.6 


67.8 



ANOVA 



Source of 
Variation 


SS 


df 


MS F P-value 


Between Groups 


604.9333 


2 


302.4667 4.256098 0.040088 


Within Groups 


852.8 


12 


71 .06667 


Total 


1457.733 


14 





16.24 There is a difference in the five brands at the 0.05 level but not at the 0.01 level. The SPSS analysis is follows. 
ANOVA 

mpg 





Sum of 
Squares 


df 


Mean Square 


F 


Sig. 


Between Groups 


52.621 


4 


13.155 


4.718 


.010 


Within Groups 


44.617 


16 


2.789 






Total 


97.238 


20 









16.25 There is a difference in the four courses at both levels. The SAS analysis is follows. 

The ANOVA Procedure 

Class Level Information 
Class Levels Values 

Subject 4 12 3 4 



Number of Observations Read 
Number of Observations Used 



16 
16 
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The ANOVA Procedure 



Dependent Variable : Grade 



Model 

Corrected Total 



Squares 
365.5708333 
198.8666667 
564.4375000 



Mean Square 
121.8569444 
16.5722222 



F Value 
7.35 



16.26 There is no difference in the operators or machines at the 0.05 level. The M1NITAB analysis is follows. 

Two-way ANOVA: Number versus Machine, Operator 

Source DF SS MS F P 



Total 8 88 

S = 2.550 R-Sq= 70.45% R-Sq ( adj ) = 40 . 91% 



16.27 There is no difference in the operators or machines at the 0.01 level. The EXCEL analysis is follows. 
Compare this EXCEL analysis and the M1NITAB analysis in the previous problem. 



Operatorl Operator2 Operator3 
machinel 23 27 24 
machine2 34 30 28 
machine3 28 25 27 



Anova: Two-Factor Without Replication 

SUMMARY Count Sum Average Variance 

Row 1 3 74 24.66667 4.333333 

Row 2 3 92 30.66667 9.333333 

Row 3 3 80 26.66667 2.333333 

Column 1 3 85 28.33333 30.33333 

Column 2 3 82 27.33333 6.333333 

Column 3 3 79 26.33333 4.333333 



ANOVA 
Source of 



Variation SS df_ MS F P-value 

Rows 56 2 28 4.307692 0.100535 

Columns 6 2 3 0.461538 0.660156 

Error 26 4 6.5 

Total 88 8 



16.28 The p- value is called s ig . in SPSS. The p-value for blocks is 0.640 and the p-value for the type of 

corn is 0.011, which is less than 0.05 and is therefore significant. There is no difference in the blocks at 
the 0.05 level. There are differences in the yield due to the type of corn at the 0.05 level. The SPSS analysis 
is as follows. 
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Tests of Between-Subjects Effects 
Dependent Variable: yield 



Source 


Type III Sum 
of Squares 


df 


Mean Square 


F 


Sig. 


Corrected Model 


77.600 3 


7 


11.086 


2.867 


.053 


Intercept 


3920.000 


1 


3920.000 


1013.793 


.000 


block 


10.000 


4 


2.500 


.647 


.640 


type 


67.600 


3 


22.533 


5.828 


.011 


Error 


46.400 


12 


3.867 






Total 


4044.000 


20 








Corrected Total 


124.000 


19 









a R Squared = .626 (Adjusted R Squared = 



16.29 At the 0.01 significance level, there is no difference in yield due to blocks o 
STATISTIX output with the SPSS output is Problem 16.28. 

Statistix 8 . 

Randomized Complete Block AOV Table for yield 

Source DF SS MS F P 



type of corn. Compare this 



10 . 000 
67 . 600 
46.400 
124 . 000 



block 
type 

Total 
Grand Mean 14.000 
Means of yield for type 
type Mean 

1 13.600 

2 17.000 

3 12.000 

4 13.400 



2 .5000 
22 .5333 
3.8667 



5.83 



16.30 SAS represents the ^-value as Pr > F. Referring to the last two lines of output, we see that both autos and 
tire brands are significant at the 0.05 level. 

The GLM Procedure 
Class Level Information 
Class Levels Values 

Auto 6 12 3 4 5 6 

Brand 4 12 3 4 

The GLM Procedure 
Dependent Variable : lifetime 

Sum of 



Source 
Model 
Error 

Corrected Total 



Squares 
201.3333333 
92.5000000 
293 . 8333333 



Mean Square 
25.1666667 
6.1666667 
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Source DF Type III SS Mean Square F Value Pr > F 

Auto 5 123.8333333 24.7666667 4.02 0.0164 

Brand 3 77.5000000 25.8333333 4.19 0.0243 



16.31 Compare the M1NITAB analysis in this problem with the SAS analysis in Problem 16.30. At the 0.01 
significance level, there is no difference in either auto or brand since the ^-values are 0.016 and 0.024 and 
both of these exceed 0.01. 

Two-way ANOVA: lifetime versus Auto, Brand 

Source DF SS MS F P 

Auto 5 123.833 24.7667 4.02 0.016 

Brand 3 77.500 25.8333 4.19 0.024 

Error 15 92.500 6.1667 

Total 23 293.833 

S= 2.483 R-Sq = 68.52% R-Sq ( adj ) = 51 . 73% 

16.32 The STATISTIX output is shown. The p-value of 0.3171 indicates no difference in schools at 0.05 level of 
significance. 

Statistix 8 . 

Randomized Complete Block AOV Table for Grade 

Source DF SS MS F P 

Method 2 604.93 302.467 

School 4 351.07 87.767 1.40 0.3171 

Error 8 501.73 62.717 

Total 14 1457.73 

The F value for teaching methods is 4.82 and the p-value for teaching methods is 0.0423. Therefore, there 
is a difference between teaching methods at the 0.05 significance level. 

16.33 The EXCEL output indicates that neither hair color nor heights of adult females have any bearing on 
scholastic achievement at significance level 0.05. The p-values for hair color is 0.4534 and the p-value for 
height is 0.2602. 

Redhead Blonde Brunette 

Tall 75 78 80 

Medium 81 76 79 

Short 73 75 77 

Anova: Two-Factor Without Replication 

SUMMARY Count Sum Average Variance 

Row 1 3 233 77.66667 6.333333 

Row 2 3 236 78.66667 6.333333 

Row 3 3 225 75 4 

Column 1 3 229 76.33333 17.33333 

Column 2 3 229 76.33333 2.333333 

Column 3 3 236 78.66667 2.333333 
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ANOVA 
Source of 

Variation SS df MS F P-value 



Rows 

Columns 

Error 



21 .55556 
10.88889 
22.44444 



2 
2 
4 



10.77778 
5.444444 
5.611111 



1 .920792 
0.970297 



0.260203 
0.453378 



Total 54.88889 8 



16.34 In the following SPSS output, the p-value is called S ig. The p-value for hair is 0.453 and the p-value for 
height is 0.260. These are the same p-values obtained in the EXCEL analysis in Problem 16.33. Since neither 
of these are less than 0.01, they are not different at level of significance 0.01. That is, the scores are not 
different for different hair colors nor are they different for different heights. 

Tests of Between- Subjects Effects 



Dependent Variable: Score 



Source 


Type III Sum 
of Squares 


df 


Mean Square 


F 


Sig. 


Corrected Model 


32.444 a 


4 


8.111 


1.446 


.365 


Intercept 


53515.111 


1 


53515.111 


9537.347 


.000 


Hair 


10.889 


2 


5.444 


.970 


.453 


Height 


21.556 


2 


10.778 


1.921 


.260 


Error 


22.444 


4 


5.611 






Total 


53570.000 


9 








Corrected Total 


54.889 


8 









a R Squared = .591 (Adjusted R Squared = .182) 



16.35 The MINITAB output shows that at the 0.05 level of significance there is a difference due to location but no 
difference due to fertilizers. There is significant interaction at the 0.05 level. 

ANOVA: yield versus location, fertilizer 



Factor 
location 



Type 
fixed 
fixed 



of Variance for yield 



location* fertili: 
Total 



SS 

81.225 
18 . 875 
78.275 
212 . 000 
390.375 



MS 
81 .225 

6.292 
26.092 

6.625 



12.26 0.001 
0.95 0.428 
3.94 0.017 



16.36 The STATISTIX output shows that at the 0.01 level of significance there is a difference due to location but 
no difference due to fertilizers. There is no significant interaction at the 0.01 level. 



544 



ANSWERS TO SUPPLEMENTARY PROBLEMS 



Statistix 8 . 

Analysis of Variance Table for yield 
Source DF SS 



fertilize 
location 

fertilize* location 
Total 



3 



18 . 875 
81.225 
78.275 
212 . 000 
390.375 



6.2917 
81.2250 
26.0917 

6.6250 



12 .26 
3.94 



0.4283 
0.0014 
0.0169 



16.37 The following SAS output gives the />-value for machines as 0.0664, the p-value for operators as 0.0004, and 
the /j-value for interaction as 0.8024. At the 0.05 significance level, only operators are significant. 
The GLM Procedure 
Class Level Information 
Class Levels Values 

Operator 4 12 3 4 

Machine 2 12 

Number of Observations Read 4 

Number of Observations Used 40 
The GLM Procedure 
Dependent Variable : Articles 



Source 


DF 


Squares 


Mean Square 


F Value 


Pr > F 


Model 


7 


154.8000000 


22 .1142857 


4.08 


. 0027 




32 


173 . 6000000 


5.4250000 






Corrected Tot 


39 


328 .4000000 










DF 


Type III SS 


Mean Square 


F Value 


Pr>F 


Machine 


1 


19 . 6000000 


19 . 6000000 


3 . 61 


.0664 


Operator 


3 


129.8000000 


43 .2666667 


7.98 


. 0004 


Operator *Machine 


3 


5.4000000 


1.8000000 


0.33 


0.8024 


The following MINITAB output is the same a: 


5 the SAS output. 







ANOVA: Articles versus Machine, Operator 

Factor Type Levels Values 



Machine 
Operator 

Analysis of Varie 



Source 
Operator 

Machine*Operator 
Total 



5 for Articles 



SS 

19 . 600 
129.800 
5.400 
173 . 600 
328 .400 



MS 
19 . 600 
43 .267 
1.800 
5.425 



0.066 
0.000 



16.38 The following STATISTIX output gives the p-values for soil vari, 

0.5658 and 0.3633 and the p- value for treatments as 0.6802. At the 0.01 significance level, 
are significant. 



two perpendicular directions as 
of the three 
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Statistix 8 . 

Latin Square AOV Table for Yield 

Source DF SS MS 

Row 3 17.500 5.8333 

Column 3 30.500 10.1667 

Treatment 3 12.500 4.1667 

Error 6 47.500 7.9167 

Total 15 108.000 



0.5658 
0.3633 
0.6802 



16.39 The following MINITAB output i; 
factors are significant at 0.05. 



the same as the STATISTIX output in Problem 16.38. None of the 



General Linear Model: Yield versus Row, Column, Treatment 

Levels Values 

4 1, 2, 3, 4 



Factor Type 

Row fixed 

Column fixed 4 1, 2, 3, 4 

Treatment fixed 4 1, 2, 3, 4 

Analysis of Variance for yield, using Adju 



;ted SS for Tests 



Source 


DF 


Seq SS 


Adj SS 


Adj MS 




F 




P 




3 


17 . 500 


17.500 


5 . 833 





74 





567 


Column 


3 


30 . 500 


30.500 


10.167 


1 


28 





362 


Treatment 


3 


12 . 500 


12 .500 


4.167 





53 





680 


Error 


6 


47 . 500 


47.500 


7.917 










Total 


15 


108.000 















16.40 The SPSS output shows no difference in scholastic achievements due to hair color, height, or birthplace at 
the 0.05 significance level. 

Tests of Between-Subjects Effects 



Dependent Variable: Score 



Source 


Type III Sum 
of Squares 


df 


Mean Square 


F 


Sig. 


Corrected Model 


44.000 a 


6 


7.333 


1.347 


.485 


Intercept 


53515.111 


1 


53515.111 


9829.306 


.000 


Hair 


10.889 


2 


5.444 


1.000 


.500 


Height 


21.556 


2 


10.778 


1.980 


.336 


Birthpl 


11.556 


2 


5.778 


1.061 


.485 


Error 


10.889 


2 


5.444 






Total 


53570.000 


9 








Corrected Total 


54.889 


8 









a R Squared = .802 (Adjusted R Squared = .206) 



16.41 The MINITAB analysis shows that there are significant differences in terms of the species of chicken and the 
quantities of the first chemical, but not in terms of the second chemical or of the chick's initial weights. Note 
that the p-value for species is 0.009 and the ^-value for chemical is 0.032. 
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General Linear Model: Wtgain versus Weight, Species, ... 



Factor 


Type 


Levels 


Value 










Weight 


fixed 


4 


1, 2, 


3, 4 








Species 


fixed 


4 


1, 2, 


3, 4 








Chemicall 


fixed 


4 


1, 2, 


3, 4 








Chemical2 


fixed 


4 


1, 2, 


3, 4 








Analysis o 


f varian 


ce for Wtga 


in, using Adjusted SS 


for Tests 




Source 


DF 


Seq SS 


Adj SS 


Adj MS 


F 




p 


Weight 


3 


2 .7500 


2 .7500 


0.9167 


2 .20 





267 


Species 


3 


38.2500 


38.2500 


12 .7500 


30.60 





009 


Chemicall 


3 


16.2500 


16.2500 


5.4167 


13.00 





032 


Chemical 2 


3 


7.2500 


7.2500 


2 .4167 


5 . 80 





091 




3 


1.2500 


1.2500 


.4167 








Total 


15 


65.7500 













16.42 The p- values are called S ig. in SPSS. There are significant differences in cable strength due to types of 
cable, but there are no significant differences due to operators, machines, or companies. 

Tests of Between-Subjects Effects 



Dependent Variable: Strength 



Source 


Type III Sum 
of Squares 


df 


Mean Square 


F 


Sig. 


Corrected Model 


6579.750 a 


12 


548.313 


5.489 


.094 


Intercept 


488251 .563 


1 


488251 .563 


4887.607 


.000 


Type 


4326.188 


3 


1442.063 


14.436 


.027 


Company 


2066.188 


3 


688.729 


6.894 


.074 


Operator 


120.688 


3 


40.229 


.403 


.763 


Machine 


66.688 


3 


22.229 


.223 


.876 


Error 


299.688 


3 


99.896 






Total 


495131.000 


16 








Corrected Total 


6879.438 


15 









a R Squared = .956 (Adjusted R Squared = .782) 



16.43 There is a significant difference in the three treatments at the 0.05 significance level, but not at the 0.01 level. 
The EXCEL analysis is given below. 

ABC 

3 4 6 
5 2 4 

4 3 5 
4 3 5 



Anova: Single Factor 
SUMMARY 



Groups 


Count 


Sum 


Averaqe 


Variance 


A 


4 


16 


4 


0.666667 


B 


4 


12 


3 


0.666667 


C 


4 


20 


5 


0.666667 
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Source of 
Variation 


SS 


df MS 


F 


P-value 


Between Groups 


8 


2 4 


6 


0.022085 


Within Groups 


6 


9 0.666667 






Total 


14 


11 







16.44 The MINITAB p-value is 0.700. There is no difference in the IQ due to height. 
One-way ANOVA: Tall, Short, Medium 

Source DF SS MS F P 



Factor 2 55.8 27.9 0.37 0.700 

Error 12 911.5 76.0 
Total 14 967.3 

S = 8.715 R-Sq=5.77% R-Sq (adj ) = . 00% 

Individual 95% CIs For Mean Based on 
Pooled StDev 

StDev + + + + -- 

10.58 ( * ) 

8.33 ( * ) 

7.15 ( * ) 

96.0 102.0 108.0 114.0 



Tall 
Short 
Medium 



107.00 
105.00 
102.50 



16.46 The p-value is called S ig. in the SPSS output. At the 0.05 level, there is 
examination scores due both to veteran status and to IQ. 

Tests of Between-Subjects Effects 

Dependent Variable: Score 



i significant difference ir 



Source 


Type III Sum 
of Squares 


df 


Mean Square 


F 


Siq. 


Corrected Model 


264.333 a 


3 


88.111 


176.222 


.006 


Intercept 


38080.667 


1 


38080.667 


76161.333 


.000 


Veteran 


24.000 


1 


24.000 


48.000 


.020 


IQ 


240.333 


2 


120.167 


240.333 


.004 


Error 


1.000 


2 


.500 






Total 


38346.000 


6 








Corrected Total 


265.333 


5 









a R Squared = .996 (Adjusted R Squared = .991) 

16.47 The STATISTIX analysis indicates that at the 0.01 level the difference in examination scores due to veteran 
status is not significant, but the difference due to the IQ is significant. 
Statistix 8 . 

Randomized Complete Block AOV Table for Score 

Source DF SS MS F P 

Veteran 1 24.000 24.000 48.00 0.020 

IQ 2 240.333 120.167 240.33 0.0041 
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Error 2 1.000 

Total 5 265.333 



16.48 The MINITAB analysis indicates that at the 0.05 level the differences in examination scores due to location 
are not significant, but the differences due to the IQ are significant. 

Two-way ANOVA: Testscore versus Location, IQ 

Source DF SS MS F P 

Location 3 6.250 2.083 0.12 0.943 

IQ 2 221.167 110.583 6.54 0.031 

Error 6 101.500 16.917 

Total 11 328.917 

16.49 The SAS analysis indicates that at the 0.01 level the differences in examination scores due to location 
are not significant, and the differences due to the IQ are not significant. Remember the p-value is written 
as Pr >F . 

The GLM Procedure 
Class Level Information 
Class Levels Values 

Location 4 12 3 4 

IQ 3 12 3 

Number of Observations Read 12 
Number of Observations Used 12 

The GLM Procedure 

Source DF Type III SS Mean Square F Value Pr>F 

Location 3 6.2500000 2.0833333 0.12 0.9430 

IQ 2 221.1666667 110.5833333 6.54 0.0311 



16.53 The MINITAB analysis indicates that at the 0.05 level the differences in rust scores due to location are not 
significant, but the differences due to the chemicals are significant. There is no significant interaction between 
location and chemicals. 



ANOVA: rust versus location, chemical 

Factor Type Levels Values 

location fixed 2 1, 2 

chemical fixed 3 1, 2, 3 

Analysis of Variance for rust 

Source DF SS MS F P 

location 1 0.667 0.667 0.67 0.425 

chemical 2 20.333 10.167 10.17 0.001 

location*chemical 2 0.333 0.167 0.17 0.848 

Error 18 18.000 1.000 

Total 23 39.333 
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16.54 The STATISTIX analysis indicates that at the 0.05 level the differences in yields due to location are 



significant, but the differences due to varieties a 
location and v 



Statistix 8 . 

Analysis of Variance Table for Yield 
Source 

Variety- 
location 

Variety* location 
Error 



>t significant. There is no significant interaction between 



Total 



59 



SS 

35.333 
191.433 
82.567 
371.250 
680.583 



8.8333 
95.7167 
10.3208 

8.2500 



1.07 
11.60 



0.3822 
0.0001 
0.2928 



16.55 The MINITAB analysis indicates that at the 0.01 level the differences in yields due to location are 
significant, but the differences due to varieties are not significant. There is no significant interaction 
between location and varieties. 

General Linear Model: Yield versus location, Variety- 
Factor Type Levels Values 
location fixed 3 1, 2, 3 



variety 


fixed 


5 


1, 2, 


3, 4, 5 










Analysis 


of Variance for 


Yield, 


using Adjusted SS 


for Tests 










DF 


Se 


q SS 


Adj SS 


Adj MS 


F 




p 




2 


191 


433 


191.433 


95.717 


11.60 





000 




4 


35 


333 


35.333 


8 . 833 


1.07 





382 




*Variety 8 


82 


567 


82 .567 


10.321 


1.25 





293 




45 


371 


250 


371.250 


8.250 








Total 


59 


680 


583 













16.56 Referring to the SPSS ANOVA, and realizing that S ig. is the same as the />- value in SPSS, we see that 
Factorl, Factor2, and Treatment do not have significant effects on the response variable at 
the 0.05 level, since the p-values are greater than 0.05 for all three. 

Tests of Between-Subjects Effects 



Dependent Variable: Response 



Source 


Type III Sum 
of Squares 


df 


Mean Square 


F 


Sig. 


Corrected Model 


92.667 a 


6 


15.444 


7.316 


.125 


Intercept 


2567.111 


1 


2567.111 


1216.000 


.001 


Factorl 


74.889 


2 


37.444 


17.737 


.053 


Factor2 


17.556 


2 


8.778 


4.158 


.194 


Treatment 


.222 


2 


.111 


.053 


.950 


Error 


4.222 


2 


2.111 






Total 


2664.000 


9 








Corrected Total 


96.889 


8 









a R Squared = .956 (Adjusted R Squared = .826) 
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16.58 Referring to the SPSS ANOVA, and realizing that S ig. is the same as the p- value in SPSS, we see that 
Factorl, Factor2 , Latin Treatment and Gr eek Tr eatment do not have significant 
effects on the response variable at the 0.05 level, since the /j-values are greater than 0.05 for all four. 

Tests of Between-Subjeets Effects 



Dependent Variable: Response 



Source 


Type III Sum 
of Squares 


df 


Mean Square 


F 


Sig. 


Corrected Model 


362.750 3 


12 


30.229 


.924 


.607 


Intercept 


1914.063 


1 


1914.063 


58.482 


.005 


Factorl 


5.188 


3 


1.729 


.053 


.981 


Factor2 


15.188 


3 


5.063 


.155 


.920 


Latin 


108.188 


3 


36.063 


1.102 


.469 


Greek 


234.188 


3 


78.063 


2.385 


.247 


Error 


98.188 


3 


32.729 






Total 


2375.000 


16 








Corrected Total 


460.938 


15 









a R Squared = .787 (Adjusted R Squared = -.065) 



CHAPTER 17 

17.26 For a two-tailed alternative, the Rvalue is 0.0352. There is a difference due to the additive at the 0.05 level, 
but not at the 0.01 level. The /rvalue is obtained using the binomial distribution and not the normal 
approximation to the binomial. 

17.27 For a one-tailed alternative, the />-value is 0.0176. Since this p-value is less than 0.05, reject the null 
hypothesis that there is no difference due to the additive. 

17.28 The rvalue using EXCEL, is given by =1 - B INOMDIST ( 24 , 3 1 , . 5 , 1 ) or 0.00044. 
The program is effective at the 0.05 level of significance. 

17.29 The rvalue using EXCEL, is given by =1 - BINOMDIST (15,22,0.5,1) or 0.0262. 
The program is effective at the 0.05 level. 

17.30 The p-value using EXCEL, is given by =1 - BINOMDIST ( 16 , 2 5 , . 5 , 1 ) or 0.0539. 
We cannot conclude that brand B is preferred over brand A at the 0.05 level. 

17.31 BS = 25 BS = 30 BS = 35 BS = 40 
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24 -1 -6 -11 -16 

0.00154388 2*BINOMDIST (4, 24, . 5, 1) Reject null 

0.307456255 2 *BINOMDIST ( 9 , 24 , . 5 , 1 ) Do not reject null 

0.541256189 2 *BINOMDIST ( 10 , 24 , . 5 , 1 ) Do not re j ect null 

0.004077315 2 *BINOMDIST ( 5 , 2 5 , . 5 , 1 ) Reject null 

17.34 Sum of ranks of the smaller sample= 141.5 and the sum of ranks for the larger sample= 158.5. Two-tailed 
p-value = 0.3488. Do not reject the null hypothesis of no difference at 0.05 level, since /rvalue > 0.05. 

17.35 Cannot reject the one-sided null hypothesis in Problem 17.34 at the 0.01 level. 

17.36 Sum of ranks of the smaller sample = 132.5 and the sum of ranks for the larger sample = 77.5. Two-tailed 
/>-value = 0.0044. Reject the null hypothesis of no difference at both the 0.01 and the 0.05 level, 

since /?-value<0.05. 

17.37 The farmer of Problem 17.36 can conclude that wheat II produces a larger yield than wheat I at the 0.05 

17.38 Sum of ranks for brand ^4 = 86.0 and the sum of ranks for brand B= 124.0. Two-tailed p-\a\ue = 0.1620. 
(a) Do not reject the null hypothesis of no difference between the two brands versus there is a difference at 
the 0.05 level, since ^-value>0.05. (b) Cannot conclude that brand B is better than brand A at the 0.05 level 
since the one-tailed p- value (0.08 1)> 0.05. 

17.39 Yes, the U test as well as the sign test may be used to determine if there is a difference between the two 
machines. 

17.41 3. 

17.42 6. 

17.46 (a) 246; (b) 168; (c) 0. 

17.47 (a) 236; (b) 115; (c) 100. 

17.49 H= 2.59, DF= 4, P = 0.629. There is no significant difference in yields of the five varieties at the 0.05 or 
the 0.01 level since the p-value is greater than 0.01 and 0.05. 

17.50 H= 8.42, DF= 3, P= 0.038. There is a significant difference in the four brands of tires at the 0.05 level, 
but not at the 0.01 level since the p-value is such that 0.01 <_p-value< 0.05. 
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17.51 77 = 6.54, DF = 2, P = 0.038. There is a significant difference in the three teaching methods at the 0.05 level, 
but not at the 0.01 level since the />value is such that 0.01 <p-value<0.05. 

17.52 77 = 9.22, DF= 3, P = 0.026. There is a significant difference in the four subjects at the 0.05 level, but not at 
the 0.01 level since the p-value is such that 0.01 </>-value<0.05. 

17.53 (a) 77 = 7.88, DF= 8, P = 0.446. There is no significant differences in the three TV tube lifetimes at the 0.01 

or the 0.05 levels since the /(-value > 0.01 and 0.05. 

(b) H=2.59, DF= 4, P= 0.629. There is no significant differences in the five varieties of wheat at the 0.01 or 
the 0.05 levels since the Rvalue > 0.01 and 0.05. 

(c) 77 = 5.70, DF= 3, P = 0.127. There is no difference in the four brands of tires at either the 0.01 or the 0.05 
levels since the /)-value>0.01 as well as > 0.05. 

17.54 (a) #=5.65, DF=2, P = 0.059. There is no difference in the three methods of teaching at either the 0.01 or 

the 0.05 levels since the />-value>0.01 as well as > 0.05. 

(b) 77= 10.25, DF=4, P = 0.036. There is a difference in the five brands of gasoline at the 0.05 level, but 
not at the 0.01 level since 0.01 < p-value<0.05. 

(c) 77=9.22, DF=3, P= 0.026. There is a difference in the four subjects at the 0.05 level, but not at the 
0.01 level since 0.01 </>-value<0.05. 

17.55 (a) 8; (b) 10. 

17.56 (a) The number of runs V= 10. 

(b) The randomness test is based on the standard normal. The mean is 



and the standard deviation is 



2(11)(14){2(11)(14)-U-14} 
° V = V 25^(24) = 1M 

The computed Z is 

Using EXCEL the p-value is= 2*NORMSDIST ( -1 . 38 ) or 0.1676. Due to the large /7-value, 
we do not doubt randomness. 



17.57 (a) Even though the number of runs is below what we expect, the p-value is not less than 0.05. We do not 
reject randomness of sequence (70). 

Runs Test: coin 

Runs test for coin 

Runs above and below K= . 4 

The observed number of runs = 7 

The expected number of runs = 10.6 

8 observations above K, 12 below 

* N is small, so the following approximation may be invalid. 
P-value = 0.084 

(b) The number of runs is above what we expect. We reject randomness of the sequence (77). 
Runs Test: coin 



Runs test for coin 

Runs above and below K = . 5 
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The observed number of runs = 12 
The expected number of runs = 7 
6 observations above K, 6 below 

* N is small, so the following approximation may be invalid. 
P-value = .002 



Sampling distribute 



Probability distribu 
V Pr{V] 

2 0.667 

3 0.333 

17.59 Mean = 2.333 Variance = 0.222 



Sampling distribute 



Probability distribute 
V Pr { V] 

2 0.333 



Mean V 
Variano 
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Sampling distribution 



Probability distribution 
V Pr { V} 

2 0.25 

3 0.75 

Mean V 2.75 
Variance V . 188 



Sampling distribution 



Probability distributic 
V Pr { V} 



Mean V 2.6 
Variance V 0.24 
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(b) 

Sampling distribution 

V f 

2 2 

3 4 

4 6 

5 3 

(c) 

Probability distribution 

V Pr { V] 

2 0.133 

3 0.267 

4 0.4 

5 0.2 

Mean V 3.667 
Variance V . 888 

17.62 Assume the rows are read in one at a time. That is row 1 is followed by row 2, is followed by row 3, and 
finally row 4. 

Runs Test: Grade 

Runs test for Grade 

Runs above and below K= 69 

The observed number of runs = 26 

The expected number of runs = 20.95 

21 observations above K, 19 below 

P-value= 0.105 

The grades may be assumed to have been recorded randomly at the 0.05 level. 

17.63 Assume the data is recorded row by row. 
Runs test for price 

Runs above and below K= 11 . 3 6 
The observed number of runs = 10 
The expected number of runs = 13.32 
14 observations above K, 11 below 
P-value= .168 

The prices may be assumed random at the 0.05 level. 

17.64 In the digits following the decimal, let represent an even digit and 1 represent and odd digit. 

Runs test for digit 

Runs above and below K= .473684 

The observed number of runs = 9 

The expected number of runs = 10 . 4737 

9 observations above K, 10 below 

* N is small, so the following approximation may be invalid. 
P-value= .485 

The digits may be assumed to be random at the 0.05 level. 



17.65 The digits may be assumed random at the 0.05 level. 
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17.66 Using the normal approximation, the computed Z value is 1.62. The computed ^-value using EXCEL 
is=2*NORMSDIST(-1.62) or 0.105. 

17.67 Using the normal approximation, the computed Z value is -1.38. The computed p- value using EXCEL 
is=2*NORMSDIST(-1.38) or 0.168. 

17.68 Using the normal approximation, the computed Z value is -1.38. The computed p- value using EXCEL 
is=2*NORMSDIST(-0.70) or 0.484. 

17.70 Spearman rank correlation = 1.0 and Pearson correlation coefficient = 0.998. 
CHAPTER 18 

18.16 Subgroup means: 13.25 14.50 17.25 14.50 13.50 14.75 13.75 15.00 15.00 17.00 
Subgroup ranges: 595689 10557 

X = 14.85, i? = 6.9. 

18.17 The pooled estimate of a is 1.741. LCL = 450.7, UCL = 455.9. None of the subgroup means is outside the 
control limits. 



18.19 The plot indicates that variability has been reduced. The new control limits are LCL = 452.9 and 
UCL = 455.2. It also appears that the process is centered closer to the target after the modification. 

18.20 The control limits are LCL = 1.980 and UCL = 2.017. Periods 4, 5, and 6 fail test 5. Periods 15 through 20 
fail test 4. Each of these periods is the end of 14 points in a row, alternating up and down. 

18.21 C PK = 0.63. ppm non-conforming = 32,487. 

18.22 C PK = 1.72. ppm non-conforming = less than 1. 

18.23 Centerline = 0.006133, LCL = 0, UCL = 0.01661. Process is in control. ppm = 6,133. 

18.24 Centerline = 3.067, LCL = 0, UCL = 8.304. 

18.25 0.032 0.027 0.032 0.024 0.024 0.027 0.032 0.032 0.027 0.024 0.032 0.024 
0.027 0.024 0.027 0.024 0.027 0.027 0.027 0.027 

18.26 Centerline = X = 349.9. 

Moving ranges: 0.0 0.2 0.6 0.8 0.4 0.3 0.1 0.4 0.4 0.9 0.2 1.1 0.5 1.5 2.8 1.6 0.3 0.1 1.2 1.9 0.2 
1.2 0.9 

Mean of the moving ranges given above = R M = 0.765. 

Individuals chart control limits: X±3(R M /d 2 ). d 2 is a control chart constant that is available from 
tables in many different sources and in this case is equal to 1.128. LCL = 347.9 and UCL = 352.0. 

18.27 The EWMA chart indicates that the process means are consistently below the target value. The means for 
subgroups 12 and 13 drop below the lower control limits. The subgroup means beyond 13 are above the 
lower control limit; however, the process mean is still consistently below the target value. 

18.28 A zone chart does not indicate any out of control conditions. However, as seen in Problem 19.20, there are 
14 points in a row alternating up and down. Because of the way the zone chart operates, it will not indicate 
this condition. 
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18.29 The 20 lower control limits are: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.38 0.00 0.00 0.00 0.00 0.00 0.38 
0.00 0.00 038 0.00 0.38 0.38. 

The 20 upper control limits are: 9.52 9.52 9.52 9.52 9.52 7.82 8.46 7.07 7.82 8.46 9.52 9.52 9.52 
7.07 9.52 9.52 7.07 9.52 7.07 7.07. 



18.30 Discoloration; discoloration and loose straps. 
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Appendix I 



Ordinates(Y) 
of the 
Standard 
Normal Curve 
atz 




.3989 
.3970 
.3910 
.3814 
.3683 

.3521 
.3332 
.3123 
.2897 
.2661 

.2420 
.2179 
.1942 
.1714 
.1497 



.0940 
.0790 
.0656 

.0540 
.0440 
.0355 
.0283 
.0224 

.0175 
.0136 
.0104 
.0079 
.0060 

.0044 
.0033 
.0024 
.0017 
.0012 

.0009 
.0006 
.0004 
.0003 
.0002 



.3989 
.3965 



.3503 
.3312 
.3101 
.2874 
.2637 

.2396 
.2155 
.1919 



.1276 
.1092 
.0925 
.0775 
.0644 

.0529 
.0431 
.0347 
.0277 
.0219 



.0101 
.0077 
.0058 

.0043 
.0032 
.0023 
.0017 
.0012 

.0008 
.0006 
.0004 
.0003 
.0002 



.3790 
.3653 

.3485 
.3292 
.3079 
.2850 
.2613 

.2371 
.2131 



.0909 
.0761 
.0632 

.0519 
.0422 
.0339 
.0270 
.0213 

.0167 
.0129 
.0099 
.0075 
.0056 

.0042 
.0031 
.0022 
.0016 
.0012 

.ooos 

.0006 
.0004 
.0003 
.0002 



.3956 
.3885 
.3778 
.3637 



.3056 
.2827 
.2589 

.2347 
.2107 
.1872 
.1647 
.1435 

.1238 
.1057 
.0893 
.0748 
.0620 

.0508 
.0413 
.0332 
.0264 
.0208 

.0163 
.0126 
.0096 
.0073 
.0055 

.0040 
.0030 
.0022 
.0016 
.0011 

.0008 
.0005 
.0004 
.0003 
.0002 



.3876 
.3765 
.3621 

.3448 
.3251 
.3034 
.2803 
.2565 

.2323 
.2083 
.1849 
.1626 
.1415 



.0404 
.0325 
.0258 
.0203 



.0053 

.0039 
.0029 
.0021 
.0015 
.0011 

.0008 
.0005 
.0004 
.0003 
.0002 



.3945 
.3867 
.3752 
.3605 

.3429 
.3230 
.3011 
.2780 
.2541 

.2299 
.2059 
.1826 
.1604 
.1394 



.0863 
.0721 
.0596 

.0488 
.0396 
.0317 
.0252 



.0069 
.0051 

.0038 
.0028 
.0020 
.0015 
.0010 

.0007 
.0005 
.0004 
.0002 
.0002 



.3939 
.3857 
.3739 
.3589 

.3410 
.3209 
.2989 
.2756 
.2516 

.2275 
.2036 
.1804 
.1582 
.1374 



.0848 
.0707 
.0584 

.0478 
.0387 
.0310 
.0246 
.0194 



.0088 
.0067 
.0050 

.0037 
.0027 
.0020 
.0014 
.0010 

.0007 
.0005 
.0003 



.3980 
.3932 
.3847 
.3725 
.3572 

.3391 
.3187 
.2966 
.2732 
.2492 

.2251 
.2012 



.0833 
.0694 
.0573 

.0468 
.0379 
.0303 
.0241 



.0086 
.0065 
.0048 

.0036 
.0026 
.0019 
.0014 
.0010 

.0007 
.0005 
.0003 
.0002 
.0002 
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Areas 



Normal Curve 
from to z 




3413 

3643 
3849 
.4032 
.4192 

.4332 
.4452 
.4554 
.4641 
.4713 

.4772 
.4821 
.4861 
.4893 
.4918 

.4938 
.4953 
.4965 
.4974 
.4981 



.4993 
.4995 
.4997 

.4998 
.4998 
.4999 
.4999 
5000 



.0040 
.0438 
.0832 



.4463 
.4564 
.4649 
.4719 

.4778 
.4826 
.4864 
.4896 
.4920 

.4940 
.4955 
.4966 
.4975 



.4991 
.4993 
.4995 
.4997 



.4999 
.4999 
5000 



.2939 
.3212 

.3461 

.3686 
.3888 
.4066 
.4222 

.4357 
.4474 
.4573 
.4656 
.4726 



.4941 
.4956 
.4967 
.4976 
.4982 

.4987 
.4991 
.4994 
.4995 
.4997 

.4998 
.4999 
.4999 
.4999 
.5000 



.0120 
.0517 
.0910 



2357 
2673 
2967 
3238 

3485 
3708 
3907 
.4082 
.4236 

.4370 
.4484 
.4582 
.4664 
.4732 

.4788 
.4834 
.4871 



.4943 
.4957 
.4968 
.4977 
.4983 

.4988 
.4991 
.4994 
.4996 
.4997 

.4998 
.4999 
.4999 
.4999 
5000 



.0160 

.0557 
.0948 



2054 

2389 



.3508 
.3729 
.3925 
.4099 
.4251 



.4591 
.4671 
.4738 

.4793 
.4838 
.4875 
.4904 
.4927 

.4945 
.4959 
.4969 
.4977 



.4992 
.4994 
.4996 
.4997 

.4998 
.4999 
.4999 
.4999 
.5000 



.0199 
.0596 
.0987 
1368 
1736 

.2088 
.2422 
2734 
3023 
3289 

.3531 
.3749 
3944 
.4115 
.4265 



.4599 
.4678 
.4744 

.4798 
.4842 
.4878 
.4906 
.4929 

.4946 
.4960 
.4970 
.4978 



.4992 
.4994 
.4996 
.4997 

.4998 
.4999 
.4999 
.4999 
5000 



,0239 
.0636 
1026 
1406 

1772 



3554 
3770 
.3962 
.4131 
.4279 

.4406 
.4515 
.4608 
.4686 
.4750 



.0279 
.0675 
1064 



2157 
2486 
2794 
3078 
3340 

3577 
.3790 
.3980 
.4147 
.4292 

.4418 
.4525 
.4616 
.4693 
.4756 

.4808 
.4850 
.4884 



.4949 
.4962 



.4992 
.4994 
.4996 
.4997 

.4998 
.4999 
.4999 
.4999 
5000 
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Appendix III 



Percentile Values (f p ) 
for 

Student's f Distribution 
with v Degrees of Freedom 
(shaded area = p) 
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Appendix IV 



Percentile Values {&) 
for 

the Chi-Square Distribution 
with v Degrees of Freedom 
(shaded area = p) 



6.63 5.02 3.84 2.71 
9.21 7.38 5.99 4.61 
11.3 9.35 7.81 6.25 



.0010 
.0506 



.0002 
.0201 



.0000 
.0100 



16.7 15.1 12.8 
18.5 16.8 14.4 



22.0 20.1 17.5 



25.2 23.2 20.5 
26.8 24.7 21.9 



28.3 26.2 23.3 



29.8 27.7 24.7 22.4 19.8 
31.3 29.1 26.1 23.7 21.1 



32.8 30.6 27.5 25.0 22.3 



35.7 33.4 30.2 27.6 24.8 
37.2 34.8 31.5 28.9 26.0 



40.0 37.6 34.2 



7.84 
9.04 



7.26 
7.96 



46.9 44.3 40.6 37.7 34.4 

48.3 45.6 41.9 38.9 35.6 

49.6 47.0 43.2 40.1 36.7 
51.0 48.3 44.5 41.3 37.9 
52.3 49.6 45.7 42.6 39.1 

53.7 50.9 47.0 43.8 40.3 

66.8 63.7 59.3 55.8 51.8 
79.5 76.2 71.4 67.5 63.2 
92.0 88.4 



83.3 



79.1 



74.4 



2 100.4 95.0 90.5 85.5 

116.3 112.3 106.6 101.9 96.6 

128.3 124.1 118.1 113.1 107.6 

2 135.8 129.6 124.3 118.5 
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Appendix V 



95th Percentile Values 
for the F Distribution 
(v { degrees of freedom in numerator) 
(v 2 degrees of freedom in denominator) 



19.0 19.2 

9.55 9.28 

6.94 6.59 

5.79 5.41 



4.46 4.07 
4.26 3.86 



4.96 4.10 3.71 

4.84 3.98 3.59 

4.75 3.89 3.49 

4.67 3.81 3.41 

4.60 3.74 3.34 

4.54 3.68 3.29 

4.49 3.63 3.24 

4.45 3.59 3.20 

4.41 3.55 3.16 

4.38 3.52 3.13 

4.35 3.49 3.10 

4.32 3.47 3.07 

4.30 3.44 3.05 

4.28 3.42 3.03 

4.26 3.40 3.01 

4.24 3.39 2.99 



19.5 19.5 
8.62 8.59 



4.50 4.46 
3.81 3.77 



2.86 2.83 
2.70 2.66 



2.25 2.20 
2.19 2.15 



.84 1.79 
.74 1.69 



Source: E. S. Pearson and H. O. Hartley, Biometrika Tables for Statisticians. Vol. 2 (1972). Table 5, page 178, by permission. 
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Appendix VI 



99th Percentile Values 
for the F Distribution 
(v 1 degrees of freedom in numerator) 
(v 2 degrees of freedom in denominator) 



20 24 30 



Source: E. S. Pearson and H. O. H irllc; »/«/;/. /. / J." ' 1 /../ Slaiisiickms. Vol 2 (1972), Table 5, page 180, by permission. 



566 

Copyright © 2008, 1999, 1988, 1961 by The McGraw-Hill Companies, Inc. Click here for terms of use. 



Appendix VII 



Four-Place Common Logarithms 



12 3 4 

0000 0043 0086 0128 0170 

0414 0453 0492 0531 0569 

0792 0828 0864 0899 0934 

9 1173 1206 1239 1271 

1461 1492 1523 1553 1584 



1761 



1790 1818 1847 1875 

2068 2095 2122 2148 

2330 2355 2380 2405 

2577 2601 2625 2648 

2788 2810 2833 2856 2878 

3010 3032 3054 3075 3096 

3243 3263 3284 3304 

3424 3444 3464 3483 3502 

3617 3636 3655 3674 3692 

3820 3838 3856 3874 

3979 3997 4014 4031 4048 

4150 4166 4183 4200 4216 

4314 4330 4346 4362 4378 

4472 4487 4502 4518 4533 

4624 4639 4654 4669 4683 

4771 4786 4800 4814 4829 

4914 4928 4942 4955 4969 

5051 5065 5079 5092 5105 

5185 5198 5211 5224 5237 

5315 5328 5340 5353 5366 

5453 5465 5478 5490 

5563 5575 5587 5599 5611 

5682 5694 5705 5717 5729 

5798 5809 5821 5832 5843 

5922 5933 5944 5955 



5 6 7 8 9 

0212 0253 0294 0334 0374 

0607 0645 0682 0719 0755 

0969 1004 1038 1072 1106 

1303 1335 1367 1399 1430 

1614 1644 1673 1703 1732 

1903 1931 1959 1987 2014 

2175 2201 2227 2253 2279 

2430 2455 2480 2504 2529 

2672 2695 2718 2742 2765 

2900 2923 2945 2967 2989 



Proportional Parts 
5 6 7 



MIS 



3160 3181 3201 



3522 



3345 3365 3385 3404 

3541 3560 3579 3598 

3711 3729 3747 3766 3784 

3892 3909 3927 3945 3962 

4065 4082 4099 4116 4133 

4232 4249 4265 4281 4298 

4393 4409 4425 4440 4456 

4548 4564 4579 4594 4609 

4698 4713 4728 4742 4757 



4900 



4843 4857 4871 

4983 4997 5011 5024 5038 

5119 5132 5145 5159 5172 

5250 5263 5276 5289 5302 

5378 5391 5403 5416 5428 

5502 5514 5527 5539 5551 

5623 5635 5647 5658 5670 

5740 5752 5763 5775 5786 

5855 5866 5877 5888 5899 

5966 5977 5988 5999 6010 



1 6031 

6128 6138 

6232 6243 

6335 6345 

6435 6444 

6532 6542 

6628 6637 

6721 6730 

6812 6821 

6902 6911 

6990 6998 7007 7016 7024 

7076 7084 7093 7101 7110 

7160 7168 7177 7185 7193 

7243 7251 7259 7267 7275 

7324 7332 7340 7348 7356 



6042 6053 6064 

6149 6160 6170 

6253 6263 6274 

6355 6365 6375 

6454 6464 6474 

6551 6561 6571 

6646 6656 6665 

6739 6749 6758 

6830 6839 6848 

6920 6928 6937 



6117 

6222 
6325 
6425 
6522 







2 



6075 6085 6096 

6180 6191 6201 

6284 6294 6304 

6385 6395 6405 

6484 6493 6503 

6580 6590 6599 

6675 6684 6693 

6767 6776 6785 

6857 6866 6875 

6946 6955 6964 

7033 7042 7050 7059 7067 

7118 7126 7135 7143 7152 

7202 7210 7218 7226 7235 

7284 7292 7300 

7364 7372 7380 

5 6 



6107 
6212 
6314 
6415 
6513 
6609 
6702 
6794 



6972 6981 



7308 7316 
7388 7396 



1 2 3 

4 8 12 17 21 25 29 33 37 

4 8 11 15 19 23 26 30 34 

3 7 10 14 17 21 24 28 31 

23 26 29 

21 24 27 



3 6 



17 20 



25 
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Four-Place Common Logarithms (continued) 



o 



5 



Proportional Parts 



7404 7412 7419 7427 7435 

7482 7490 7497 7505 7513 

7559 7566 7574 7582 7589 

7634 7642 7649 7657 7664 

7709 7716 7723 7731 7738 



7443 7451 7459 7466 7474 

7520 7528 7536 7543 7551 

7597 7604 7612 7619 7627 

7672 7679 7686 7694 7701 

7745 7752 7760 7767 7774 



7782 7789 7796 7803 7810 

7853 7860 7868 7875 7882 

7924 7931 7938 7945 7952 

7993 8000 8007 8014 8021 

8062 8069 8075 8082 8089 



7818 7825 7832 7839 7846 

7889 7896 7903 7910 7917 

7959 7966 7973 7980 7987 

8028 8035 8041 

8096 8102 8109 



8048 8055 
S116 8122 



8129 8136 8142 8149 8156 

8195 8202 8209 8215 8222 

1 8267 8274 8280 8287 

8325 8331 8338 8344 8351 

8 8395 8401 8407 8414 



8162 8169 8176 8182 

8228 8235 8241 8248 

8293 8299 8306 8312 

8357 8363 8370 8376 

8420 8426 8432 8439 



1 8457 8463 8470 8476 

8513 8519 8525 8531 8537 

8573 8579 8585 8591 8597 

8633 8639 8645 8651 8657 

8692 8698 8704 8710 8716 



8494 8500 8506 



8543 8549 8555 

8603 8609 8615 

8663 8669 8675 

8722 8727 8733 



8561 8567 

8621 8627 

8681 8686 

8739 8745 



8797 8802 



2 2 3 3 



3 8899 8904 8910 8915 
9 8954 8960 8965 8971 
9004 9009 9015 9020 9025 



2 2 3 3 
2 2 3 3 



9196 9201 9206 9212 
9248 9253 9258 9263 



9058 9063 9069 9074 9079 

?1 12 9117 9122 9128 9133 

9165 9170 9175 

9217 9222 9227 

9269 9274 9279 



9180 9186 
9238 
9284 9289 



4 9299 9304 9309 9315 

5 9350 9355 9360 9365 
9395 9400 9405 9410 9415 
9445 9450 9455 9460 9465 
9494 9499 9504 9509 9513 



9320 9325 9330 9335 9340 

9370 9375 9380 9385 9390 

9420 9425 9430 9435 9440 

9469 9474 9479 9484 9489 

9518 9523 9528 9533 9538 



2 2 3 3 



2 9547 9552 9557 9562 

9595 9600 9605 9609 

9638 9643 9647 9652 9657 

9685 9689 9694 9699 9703 

9731 9736 9741 9745 9750 



9566 9571 9576 

9614 9619 9624 

9661 9666 9671 

9708 9713 9717 

9754 9759 9763 



9581 9586 

9628 9633 

9675 9680 

9722 9727 

9768 9773 



9777 9782 9786 9791 9795 

9823 9827 9832 9836 9841 

8 9872 9877 9881 9886 

9912 9917 9921 9926 9930 

9956 9961 9965 9969 9974 







3 



9845 9850 9854 



9859 9863 

9903 9908 

9948 9952 

9991 9996 



2 2 3 3 
2 2 3 3 
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Appendix VIII 



Values of e 

(0<A < 1) 



1.0000 
.9048 



.7408 
.6703 



.7334 
.6636 



.8025 
.7261 
.6370 



.7945 
.7189 
.6505 



.7118 
.6440 



.9512 
.8607 



.7047 
.6376 



.9418 
.8521 



.6977 
.6313 



.9324 
.8437 
.7634 
.6907 
.6250 



.6065 
.5488 
.4966 
.4493 
.4066 



.6005 
.5434 
.4916 
.4449 
.4025 



.5945 
.5379 



.5326 
.4819 
.4360 
.3946 



.5827 
.5273 
.4771 
.4317 



.5770 
.5220 
.4724 
.4274 
.3867 



.5712 
.5169 

.4677 
.4232 
.3829 



.5655 
.5117 
.4630 
.4190 
.3791 



,10) 



e- x .36788 .13534 .04979 .01832 .006738 .002479 

Note: To obtain values of e x for other values of A, use the laws of exponents. 
Example: e~ 3 48 = (<r 300 )(e- 48 ) = (0.04979)(0.6188) = 0.03081. 
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Appendix IX 



Random Numbers 



51772 
24033 
45939 



64937 
15630 
09448 
21631 
91097 

50532 
07136 
27989 
85184 
54398 

65544 
08263 
39817 
62257 
53298 



74640 
23491 
60173 
02133 
79353 

03355 
64759 
56301 
91157 



64728 
73949 
21154 

34371 
65952 
67906 
04077 
90276 



42331 
S35K7 
52078 
75797 
81938 

95863 
51135 
57683 
77331 
29414 

95652 
79971 
10744 
36601 
97810 

09591 
85762 
48236 
79443 
62545 



29044 
06568 
25424 
45406 
82322 

20790 
98527 
30277 
60710 
06829 

42457 
54195 
08396 
46253 
36764 

07839 
64236 
16057 
95203 
21944 



4662 1 
21960 
11645 
31041 
96799 

65304 
62586 
94623 
52290 
87843 

73547 
25708 
56242 
00477 
32869 

58892 
39238 
81812 
02479 
16530 



21387 
55870 
86707 
85659 



85418 
16835 
28195 

76552 
51817 
90985 
25234 



92843 
18776 
15815 
30763 
03878 



93582 
76105 
56974 
12973 
36081 

00745 
25439 
68829 
48653 
27279 

50020 
36732 

09908 
55261 



84303 
63700 
92486 
07516 



37428 
17169 



06652 
71590 
47152 

24819 
72484 
99431 
36574 
59009 

91341 

99247 
85915 
54083 
95715 



19640 
97453 
93507 
88116 
14070 

11822 
24034 
41982 
16159 

35683 

52984 
94923 
50995 
72139 



46149 
19219 

23631 
02526 



87056 
90581 
94271 
42187 
74950 

15804 
67283 
49159 
14676 
47280 

76168 
75936 
20507 
70185 

38723 

63886 
03229 
45943 
05825 
33537 
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Abscissa, 4 

Absolute dispersion, 100 
Absolute value, 95 

Acceptance region, 247 (see also Hypotheses) 
Alternative hypothesis, 245 
Analysis of variance, 362-401 

mathematical model for, 403 

one-factor experiments using, 403 

purpose of, 403 

tables, 406 

two-factor experiments using, 407 

using Graeco-Latin squares, 413 

using Latin squares, 413 
Approximating curves, equations of, 317 
Areas: 

of chi-square distribution, 277 
of F distribution, 279 
of normal distribution, 173 
of t distribution, 275 
Arithmetic mean, 61 
assumed or guessed, 63 
Charlier's check for, 99 
coding method for computing, 63 
computed from grouped data, 63 
effect of extreme values on, 71 
for population and for sample, 144 
long and short methods for computing, 63 
of arithmetic means, 63 
probability distributions, 142 
properties of, 63 

relation to median and mode, 64 

relation to geometric and harmonic means, 66 

weighted, 62 
Arithmetic progression: 

moments of, 136 

variance of, 120 
Arrays, 37 

Assignable causes, 480 
Asymptotically normal, 204 
Attributes control chart, 48 1 
Attributes data, 481 
Attributes, correlation of, 298 
Autocorrelation, 351 
Average, 62 



Bar charts or graphs, component part, 4 

of common logarithms, 6 

of natural logarithms, 6 
Bayes' rule or theorem, 170 
Bernoulli distribution (see Binomial distribution) 
Best estimate, 228 
Biased estimates, 227 
Bimodal frequency curve, 41 
Binomial coefficients, 173 

Pascal's triangle for, 180 
Binomial distribution, 172 

fitting of data, 195 

properties of, 172 

relation to normal distribution, 174 

relation to Poisson distribution, 175 

tests of hypotheses using, 250 
Binomial expansion or formula, 173 
Bivariate: 

frequency distribution or table, 366 

normal distribution, 351 

population, 351 
Blocks, randomized, 413 

Categories, 37 
C-chart, 490 
Cell frequencies, 496 
Center of gravity, 320 
Centerline, 480 
Central limit theorem, 204 
Centroid, 320 
Charlier's check, 99 

for mean and variance, 99 

for moments, 124 
Chi-square, 294 

additive property of, 299 

definition of, 294 

for goodness of fit, 295 

formulas for in contingency tables, 297 

test, 295 

Yates' correction for, 297 
Chi-square distribution, 297 (see also Chi-square) 
confidence intervals using, 278 
tests of hypothesis and significance. 294 
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INDEX 



Circular graph (see Pie graph) 
Class, 37 (see also Class intervals) 
Class boundaries, lower and upper, 38 
Class frequency, 37 
Class intervals, 38 

median, 64 

modal, 43 

open, 38 

unequal, 51 

width or size of, 38 
Class length, size or width, 38 
Class limits, 38 

lower and upper, 38 

true, 38 
Class mark, 38 
Coding methods, 63 

for correlation coefficient, 350 

for mean, 63 

for moment, 124 

for standard deviation, 98 
Combinations, 146 
Combinatorial analysis. 146 
Common causes, 480 
Common logarithms, 6 
Complete randomization, 413 
Compound event, 143 
Compound interest formula, 85 
Computations, 3 

rules for, 3 

rules for, using logarithms, 7 

Conditional probability, 140 

Confidence coefficients, 228 

Confidence interval: 

for differences and sums, 230 

for means, 229 

for proportions, 229 

for standard deviations, 230 

in correlation and regression, 374 

using chi-square distribution, 278 

using normal distribution, 229-230 

using t distribution, 276 

Confidence levels, table of, 229 

Confidence limits, 229 

Constant, 1 

Contingency tables, 296 

correlation coefficient from, 298 

formulas for chi-square in, 298 
Contingency, coefficient of, 310 
Continuous data, 1 

graphical representation of, 57 
Continuous probability distributions, 143 
Continuous variable, 1 
Control charts, 480 
Control limits, 480 
Coordinates, rectangular, 4 
Correlation, 345 

auto-, 351 



Correlation (Contd.): 

coefficient of (see Correlation coefficient) 
linear, 345 
measures of, 346 
of attributes, 311 
partial, 385 

positive and negative, 345 

rank, 450 

simple, 345 

spurious, 549 
Correlation coefficient, 348 

for grouped data, 350 

from contingency tables, 298 

product-moment formula for, 350 

regression lines and, 351 

sampling theory and, 351 
Correlation table, 350 

Counting method in Mann-Whitney U test, 447 

Countings or enumerations, 2 

CP index, 485 

CPK index, 485 

Critical region, 247 

Critical values, 248 

Cubic curve, 317 

Cumulative frequency, 40 

Cumulative probability distributions, 143 

Cumulative rounding errors, 2 

Curve fitting, 316 

freehand method of, 318 

least-squares method of, 319 
Cusum chart, 490 
Cyclic pattern in runs test, 449 

Data: 

grouped, 38 
raw, 37 

spread or variation of, 89 
Deciles, 66 

from grouped data, 87 

standard errors for, 206 
Decision rules, 246 (sec also Statistical decisions) 
Deductive statistics, 1 
Defective item, 487 
Defects, 487 

Degrees of freedom, 276 
Density function, 144 
Dependent events, 140 
Dependent variable, 4 

change of in regression equation, 384 
Descriptive statistics, 1 
Design of experiments, 412 
Determination: 

coefficient of, 348 

coefficient of multiple, 384 
Deviation from arithmetic mean, 63 
Diagrams (see Graphs) 
Dimensionless moments, 124 



INDEX 
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Discrete data, 2 

graphical representation of, 54 
Discrete probability distributions, 142 
Discrete variable, 1 
Dispersion, 87 (see also Variation) 

absolute, 95 

coefficient of, 100 

measures of, 95 

relative, 100 
Distribution function, 142 
Domain of variable, 1 

Efficient estimates and estimators, 228 
Empirical probability, 139 

Empirical relation between mean, median, and mode, 
64 

Empirical relation between measures of dispersion, 
100 

Enumerations, 2 
Equations, 5 
equivalent, 5 

left and right hand members of, 5 

of approximating curves, 317 

quadratic, 35 

regression, 382 

simultaneous, 25, 34 

solution of, 5 

transposition in, 26 
Errors: 

grouping, 39 

probable, 230 

rounding, 2 
Estimates (see also Estimation) 

biased and unbiased, 227 

efficient and inefficient, 228 

point and interval, 228 
Estimation, 227 
Euler diagram, 146 
Events, 140 

compound, 140 

dependent, 140 

independent, 140 

mutually exclusive, 141 
EWMA chart, 490 
Exact or small sampling theory, 181 
Expectation, mathematical, 144 
Expected or theoretical frequencies, 294 
Experimental design, 412 
Explained variation, 348 
Exponent, 2 
Exponential curve, 318 

F distribution, 279 (see also Analysis of 

variance) 
Factorial, 143 

Stirling's formula for, 146 



Failure, 139 

Fitting of data, 19 (see also Curve fitting) 

by binomial distribution, 195 

by normal distribution, 196 

by Poisson distribution, 198 

using probability graph paper, 177 
Four-dimensional space, 385 
Freehand method of curve fitting, 3 1 8 
Frequency (see also Class frequency) 

cumulative, 40 

modal, 41 

relative, 40 
Frequency curves, 41 

relative, 41 

types of, 41 
Frequency distributions, 37 

cumulative, 40 

percentage or relative, 39 

rule for forming, 38 
Frequency function, 142 
Frequency polygons, 39 

percentage or relative, 39 

smoothed, 41 
Frequency table (see also Frequency 
distributions) 

cumulative, 40 

relative, 40 
Function, 4 

distribution, 142 

frequency, 142 

linear, 318 

multiple-valued, 4 

probability, 139 

single-valued, 4 

Geometric curve, 318 
Geometric mean, 65 

from grouped data, 83 

relation to arithmetic and harmonic mean, 
66 

suitability for averaging ratios, 84 
weighted, 84 
Gompertz curve, 318 

Goodness of fit test, 177 (see also Fitting of data) 
Gossett, 276 

Graeco-Latin squares, 413 
Grand mean, 404 
Graph, 5 

bar charts, 18 

pie, 5 
Graph paper: 

log-log, 339 

probability, 177 

semilog, 318 
Group mean, 404 
Grouped data, 38 
Grouping error, 39 
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H statistic, 448 
Harmonic mean, 65 

relation to arithmetic and geometric means, 66 

weighted, 86 
Histograms, 39 

computing medians for, 64 

percentage or relative frequency, 39 

probability, 153 
Homogeneity of variance test, 285 
Hyperbola, 318 
Hyperplane, 385 
Hypotheses, 245 

alternative, 245 

null, 245 

Identity, 5 

Independent events, 140 

Independent variable, 4 

Individuals chart, 489 

Inductive statistics, 1 

Inefficient estimates and estimators, 228 

Inequalities, 5 

Inequality symbols, 5 

Inspection unit, 489 

Interaction, 411 

Interaction plot, 429 

Intercepts, 319 

Interest, compound, 85 

Interpolation, 7 

Interquartile range, 96 

semi-, 96 
Intersection of sets, 146 
Interval estimates. 228 

Kruskal-Wallis H test, 448 
Kurtosis, 125 

moment coefficient of, 125 

of binomial distribution, 173 

of normal distribution, 125 

of Poisson distribution, 176 

percentile coefficient of, 125 

Latin squares, 413 

orthogonal, 413 
Least squares: 

curve, 319 

line, 319 

parabola, 320 

plane, 321 
Leptokurtic, 125 
Level of significance, 246 
1 ogarithms. 6 

base of, 6 

common, 6 

computations using, 7 

natural, 6 
Logistic curve, 318 



Lower capability index, 485 
Lower control limit, 480 
Lower specification limit, 484 

Main effects plot, 429 
Mann-Whitney U test, 454 
Marginal frequencies, 294 
Mean deviation, 95 

for grouped data, 96 

of normal distribution, 174 
Measurements, 2 
Measures of central tendency, 61 
Median, 64 

computed from histogram or percentage ogive, 64 
effect of extreme values on, 78 
for grouped data, 64 

relation to arithmetic mean and mode, 64 
Median chart, 489 
Mesokurtic, 125 
Modal class, 43 
Mode, 64 

for grouped data, 64 

formula for, 64 

relation to arithmetic mean and median, 65 
Model or theoretical distribution, 177 
Moment coefficient of kurtosis, 125 
Moment coefficient of skewness, 125 
Moments, 123 

Charlier's check for computing, 124 

coding method for computing, 124 

definition of, 123 

dimensionless, 124 

for grouped data, 123 

relations between, 124 

Sheppard's corrections for, 124 
Multimodal frequency curve, 41 
Multinomial distribution, 177 
Multinomial expansion, 177 
Multiple correlation, 382 
Multiple determination, coefficient of. 384 
Mutually exclusive events, 141 

Natural logarithms, base of, 6 
Nonconformance rates, 484 
Nonconforming item, 486 
Nonlinear: 

correlation and regression, 346 

equations reducible to linear form, 320 

multiple regression, 382 

relationship between variables, 316 
Nonparametric tests, 446 

for correlation, 450 

Kruskal-Wallis H test, 448 

Mann-Whitney U test, 447 

runs test, 449 

sign test, 440 
Nonrandom, 449 
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Nonsense correlation, 350 

Normal approximation to the binomial distribution, 
174 

Normal curve, 173 

areas under, 173 

ordinates of, 187 

standard form of, 173 
Normal distribution, 173 

proportions of, 174 

relation to binomial distribution, 174 

relation to Poisson distribution, 176 

standard form of, 174 
Normal equations: 

for least-squares line, 320 

for least-squares parabola. 320 

for least-squares plane, 321 
Normality test, 196 
iVP-chart, 487 
nth degree curve, 317 
Null hypothesis, 245 
Null set, 146 

Observed frequencies, 294 

OC curves (see Operating characteristic curve) 

Ogives, 40 

deciles, percentiles, and quarliles obtained from, 87 

"less than", 52 

median obtained from, 78 

"or more", 52 

percentage, 53 

smoothed, 53 
One-factor experiments, 403 
One-sided or one-tailed tests, 247 
One-way classification, 403 
"Or more" cumulative distribution, 40 
Ordinates, 4 
Origin, 5 

Orthogonal Latin square, 413 
Parabola, 320 

Parameters, estimation of, 227 

Pareto chart, 504 

Partial correlation, 382 

Partial regression coefficients, 382 

Parts per million (ppm), 484 

Pascal's triangle, 180 

P-chart, 487 

Pearson's coefficients of skewness, 125 
Percentage: 

cumulative distributions, 39 

cumulative frequency, 39 

distribution, 39 

histogram, 38 

ogives, 39 
Percentile coefficient of kurtosis. 125 
Percentile range, 96 
Percentiles, 66 



Permutations, 145 
Pie graph or chart, 5 
Plane, 4 
Platykurtic, 125 
Point estimates, 228 
Poisson distribution, 175 

fitting of data by, 198 

properties of, 175 

relation to binomial and normal distribution, 
176 

Polynomials, 318 
Population, 1 

Population parameters, 228 
Positive correlation, 346 
Probability, 139 

axiomatic, 140 

classic definition of, 139 

combinatorial analysis and, 143 

conditional, 140 

distributions, 142 

empirical, 140 

fundamental rules of, 146 

relation to point set theory, 146 

relative frequency definition of, 140 
Probability function, 142 
Probable error, 230 
Process capability index, 485 
Process spread, 484 

Product-moment formula for correlation coefficient, 
350 

Proportions, 205 

confidence interval for, 229 

sampling distribution of, 205 

tests of hypothesis for, 245 
p-value, 248 

Quadrants, 4 
Quadratic: 

curve, 317 

equation, 35 

function, 17 
Quantiles, 66 
Quartic curve, 317 

Quartile coefficient of relative dispersion, 116 
Quartile coefficient of skewness, 125 
Quartiles, 66 
Quintiles, 87 

Random: 

errors, 403 

numbers, 203 

sample, 203 

variable, 142 
Randomization, complete, 413 
Randomized blocks, 413 
Range, 95 

10-90 percentile, 96 
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Range (Contd.): 

interquartile, 96 

semi-interquartile, 96 
Rank correlation, coefficient of, 450 
Raw data, 37 
Regression, 321 

curve, 321 

line, 321 

multiple, 345 

plane, 321 

sampling theory of, 352 

simple, 343 

surface, 322 
Relative dispersion or variance, 95 
Relative frequency, 39 

curves, 39 

definition of probability, 139 

distribution, 39 

table, 39 
Residual, 319 
Residual variation, 409 
Reverse J-shaped distribution, 41 
Root mean square, 66 
Rounding errors, 2 
Rounding of data, 2 
Row means, 403 
Runs, 449 

Sample, 1 
Sample space, 146 
Sample statistics, 203 
Sampling: 

with replacement, 204 

without replacement, 204 
Sampling distributions, 204 

experimental, 208 

of differences and sums, 205 

of means, 204 

of proportions, 205 

of various statistics, 204 
Sampling numbers, 213 
Sampling theory, 203 

large samples, 207 

of correlation, 351 

of regression, 352 

use of in estimation, 227 

use of in tests of hypotheses and significance, 
245 

Scientific notation, 2 
Semi-interquartile range, 96 
Semilog graph paper, 318 
Sheppard's correction for moments, 124 
Sheppard's correction for variance, 100 
Sign test, 446 

Significant digits or figures, 3 
Simple correlation, 382 
Simultaneous equations, 5 



Single-valued functions, 4 
Skewed frequency curves, 41 
Skewness, 41 

10-90 percentile coefficient of, 125 

binomial distribution, 172 

negative, 41 

normal distribution. 173 

Pearson's coefficients of, 125 

Poisson distribution, 175 

quartile coefficient of, 125 
Slope of a line, 318 
Small sampling theory, 275 
Solution of equations, 5 
Spearman's formula for rank correlation, 450 
Special causes, 480 
Specification limits, 484 
Spurious correlation, 349 
Standard deviation, 96 

coding method for, 98 

confidence interval for, 230 

from grouped data, 98 

minimal property of, 98 

of probability distribution, 155 

of sampling distributions, 204 

properties of, 98 

relation of population and sample, 97 

relation to mean deviation, 100 

short methods for computing, 98 
Standard error of estimate, 348 

modified, 348 
Standard errors of sampling distributions, 206 
Standard scores, 101 
Standardized variable, 101 
Statistical decisions, 245 
Statistics, 1 

deductive or descriptive, 1 

definition of, 1 

inductive, 1 

sample, 1 
Stirling's approximation to n\, 146 
Straight line, 317 

equation of, 317 

least-squares, 318 

regression, 321 

slope of, 318 
Subgroups, 481 
Subscript notation, 61 
Success, 139 
Summation notation, 61 
Symmetric or bell-shaped curve, 41 

t distribution, 275 
t score or statistic, 275 
Table, 4 
Table entry, 42 
Tally sheet, 39 

Ten to ninety percentile range, 96 
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Test statistic, 247 
for special causes, 484 

for differences of means and proportions, 249 
for means and proportions, 249 
involving the binomial distribution, 250 
involving the normal distribution, 246 
of hypotheses and significance, 245 
relating to correlation and regression, 352 

Tetrachoric correlation, 299 

Theoretical frequencies, 295 

Ties, in the Kruskal-Wallis H test, 448 
in the Mann-Whitney U test, 447 

Total variation, 348 

Treatment, 345 

Treatment means, 345 

Two-factor experiment, 407 

Two-sided or two-tailed test, 247 

Two-way classifications, 407 

Type I and Type II errors, 246 

[/-chart, 490 

U-shaped frequency curves, 41 
U statistic, 447 
Unbiased estimates, 227 
Unexplained variation. 348 
Unimodal distribution, 64 
Union of sets, 146 
Upper capability index, 485 
Upper control limit, 480 
Upper specification limit, 484 

Variable, 1 

dependent, 4 
discrete, 1 
domain of, 1 
independent, 4 
normally distributed, 173 
standardized, 101 



Variables: 

control chart. 480 
data, 480 

relationship between, 316 (see also Curve 
fitting; Correlation; Regression) 
Variance, 97 (see also Standard deviation) 

Charlier's check for, 99 

combined or pooled, 98 

modified sample, 227 

of probability distribution, 155 

of sampling distributions, 204 

relation between population and sample, 144 

Sheppard's correction for, 100 
Variation, 95 (see also Dispersion) 

coefficient of, 100 

explained and unexplained, 348 

quartile coefficient of, 116 

residual, 408 

total, 348 
Venn diagram (see Euler diagram) 

Weighted: 

arithmetic mean, 62 

geometric mean, 83 

harmonic mean, 85 
Weighting factors. 62 

X and Y axes of rectangular coordinate system, 4 
X-bar and R charts, 48 1 
Z-intercept, 283, 289-291 
AT-plane, 4 

y-axis, 4 

Yates' correction for continuity, 297 

Zero-order correlation coefficient, 383 
Zero point, 4 
Zone chart, 489 



